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Abstract 

Regularization Networks and Support Vector Machines are techniques for solving certain 
problems of learning from examples - in particular the regression problem of approximat- 
ing a multivariate function from sparse data. We present both formulations in a unified 
framework, namely in the context of Vapnik's theory of statistical learning which provides 
a general foundation for the learning problem, combining functional analysis and statistics. 
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1 Introduction 

The purpose of this paper is to present a theoretical framework for the problem of learning from 
examples. Learning from examples can be regarded as the regression problem of approximating 
a multivariate function from sparse data - and we will take this point of view here 1 . The problem 
of approximating a function from sparse data is ill-posed and a classical way to solve it is regular- 
ization theory [92, 10, 11]. Classical regularization theory, as we will consider here 2 , formulates 
the regression problem as a variational problem of finding the function / that minimizes the 
functional 
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where ||/|||: is a norm in a Reproducing Kernel Hilbert Space 7i defined by the positive definite 
function K, I is the number of data points or examples (the / pairs (xj, £/«)) and A is the regular- 
ization parameter (see the seminal work of [102]). Under rather general conditions the solution 
of equation (1) is 

/(x) = X>tf(x,xO. (2) 

Until now the functionals of classical regularization have lacked a rigorous justification for a finite 
set of training data. Their formulation is based on functional analysis arguments which rely on 
asymptotic results and do not consider finite data sets 3 . Regularization is the approach we have 
taken in earlier work on learning [69, 39, 77]. The seminal work of Vapnik [94, 95, 96] has now 
set the foundations for a more general theory that justifies regularization functionals for learning 
from finite sets and can be used to extend considerably the classical framework of regularization, 
effectively marrying a functional analysis perspective with modern advances in the theory of 
probability and statistics. The basic idea of Vapnik's theory is closely related to regularization: 
for a finite set of training examples the search for the best model or approximating function has 
to be constrained to an appropriately "small" hypothesis space (which can also be thought of as 
a space of machines or models or network architectures). If the space is too large, models can 
be found which will fit exactly the data but will have a poor generalization performance, that 
is poor predictive capability on new data. Vapnik's theory characterizes and formalizes these 
concepts in terms of the capacity of a set of functions and capacity control depending on the 
training data: for instance, for a small training set the capacity of the function space in which 
/ is sought has to be small whereas it can increase with a larger training set. As we will see 
later in the case of regularization, a form of capacity control leads to choosing an optimal A in 
equation (1) for a given set of data. A key part of the theory is to define and bound the capacity 
of a set of functions. 

Thus the key and somewhat novel theme of this review is a) to describe a unified framework for 
several learning techniques for finite training sets and b) to justify them in terms of statistical 
learning theory. We will consider functionals of the form 



lr rhere is a large literature on the subject: useful reviews are [44, 19, 102, 39], [96] and references therein. 

2 The general regularization scheme for learning is sketched in Appendix A. 

3 The method of quasi-solutions of Ivanov and the equivalent Tikhonov's regularization technique were devel- 
oped to solve ill-posed problems of the type Af = F, where A is a (linear) operator, / is the desired solution in 
a metric space E\, and F are the "data" in a metric space Ei- 



ff[/] = yEVfo, /(*)) + AH/I&, (3) 

where V(-, ■) is a /oss function. We will describe how classical regularization and Support Vector 
Machines [96] for both regression (SVMR) and classification (SVMC) correspond to the mini- 
mization of H in equation (3) for different choices of V: 

• Classical (L2) Regularization Networks (RN) 

V(y i ,f(x i )) = (y i -f(x i )) 2 (4) 

• Support Vector Machines Regression (SVMR) 

V{y i ,f{x i )) = \y i -f{x i )\ e (5) 

• Support Vector Machines Classification (SVMC) 

V{y i ,f{x i )) = \l-y i f{x i )\ + (6) 

where | • | e is Vapnik's epsilon-insensitive norm (see later), \x\ + = x if x is positive and zero 
otherwise, and yi is a real number in RN and SVMR, whereas it takes values —1, 1 in SVMC. 
Loss function (6) is also called the soft margin loss function. For SVMC, we will also discuss two 
other loss functions: 

• The hard margin loss function: 

V(y i ,f(x)) = 9(l-y i f(x i )) (7) 

• The mis classification loss function: 

V(y i ,f(x)) = 9(-y i f(x i )) (8) 

Where $(■) is the Heaviside function. For classification one should minimize (8) (or (7)), but in 
practice other loss functions, such as the soft margin one (6) [22, 95], are used. We discuss this 
issue further in section 6. 

The minimizer of (3) using the three loss functions has the same general form (2) (or /(x) = 
Yh=i CiK{x., Xj) + b, see later) but interestingly different properties 4 . In this review we will show 
how different learning techniques based on the minimization of functionals of the form of H in 
(3) can be justified for a few choices of V(-, ■) using a slight extension of the tools and results 
of Vapnik's statistical learning theory. In section 2 we outline the main results in the theory of 
statistical learning and in particular Structural Risk Minimization - the technique suggested by 
Vapnik to solve the problem of capacity control in learning from "small" training sets. At the 
end of the section we will outline a technical extension of Vapnik's Structural Risk Minimization 
framework (SRM). With this extension both RN and Support Vector Machines (SVMs) can be 
seen within a SRM scheme. In recent years a number of papers claim that SVM cannot be 



4 For general differentiable loss functions V the form of the solution is still the same, as shown in Appendix C. 



justified in a data-independent SRM framework (i.e. [86]). One of the goals of this paper is 
to provide such a data-independent SRM framework that justifies SVM as well as RN. Before 
describing regularization techniques, section 3 reviews some basic facts on RKHS which are 
the main function spaces on which this review is focused. After the section on regularization 
(section 4) we will describe SVMs (section 5). As we saw already, SVMs for regression can be 
considered as a modification of regularization formulations of the type of equation (1). Radial 
Basis Functions (RBF) can be shown to be solutions in both cases (for radial K) but with a 
rather different structure of the coefficients q. 

Section 6 describes in more detail how and why both RN and SVM can be justified in terms of 
SRM, in the sense of Vapnik's theory: the key to capacity control is how to choose A for a given 
set of data. Section 7 describes a naive Bayesian Maximum A Posteriori (MAP) interpretation 
of RNs and of SVMs. It also shows why a formal MAP interpretation, though interesting and 
even useful, may be somewhat misleading. Section 8 discusses relations of the regularization and 
SVM techniques with other representations of functions and signals such as sparse representations 
from overcomplete dictionaries, Blind Source Separation, and Independent Component Analysis. 
Finally, section 9 summarizes the main themes of the review and discusses some of the open 
problems. 

2 Overview of statistical learning theory 

We consider the case of learning from examples as defined in the statistical learning theory 
framework [94, 95, 96]. We have two sets of variables x G X C R d and y G Y C R that are 
related by a probabilistic relationship. We say that the relationship is probabilistic because 
generally an element of X does not determine uniquely an element of Y , but rather a probability 
distribution on Y. This can be formalized assuming that a probability distribution P(x,y) is 
defined over the set X x Y. The probability distribution P(x, y) is unknown, and under very 
general conditions can be written as P(x,y) = P(x)P(y|x) where P(y\x) is the conditional 
probability of y given x, and -P(x) is the marginal probability of x. We are provided with 
examples of this probabilistic relationship, that is with a data set Di = {(xj,?/j) G X x V}' =1 
called the training data, obtained by sampling I times the set X x Y according to P(x,y). The 
problem of learning consists in, given the data set D\, providing an estimator, that is a function 
/ : X — > Y, that can be used, given any value of x G X, to predict a value y. 
In statistical learning theory, the standard way to solve the learning problem consists in defining 
a risk functional, which measures the average amount of error associated with an estimator, and 
then to look for the estimator, among the allowed ones, with the lowest risk. If V(y, /(x)) is the 
loss function measuring the error we make when we predict y by /(x) 5 , then the average error is 
the so called expected risk: 

/[/]=/ V{y,f{x))P{x,y)dxdy (9) 

We assume that the expected risk is defined on a "large" class of functions T and we will denote 
by /o the function which minimizes the expected risk in T: 

/o(x) = argmin/[/] (10) 



^Typically for regression the loss functions is of the form V(y — /(x)). 



The function / is our ideal estimator, and it is often called the target function 6 . 
Unfortunately this function cannot be found in practice, because the probability distribution 
P(x,y) that defines the expected risk is unknown, and only a sample of it, the data set Di, is 
available. To overcome this shortcoming we need an induction principle that we can use to "learn" 
from the limited number of training data we have. Statistical learning theory as developed by 
Vapnik builds on the so-called empirical risk minimization (ERM) induction principle. The ERM 
method consists in using the data set D\ to build a stochastic approximation of the expected 
risk, which is usually called the empirical risk, and is defined as 7 : 

W/;1 = 7 5>(w» /(*))■ ( n ) 

The central question of the theory is whether the expected risk of the minimizer of the empirical 
risk in T is close to the expected risk of Jq. Notice that the question is not necessarily whether 
we can find /o but whether we can "imitate" /o in the sense that the expected risk of our solution 
is close to that of /q. Formally the theory answers the question of finding under which conditions 
the method of ERM satisfies: 

lim/ emp [/,;/] = lim/[^]=/[/ ] (12) 

l^oo l—>oo 

in probability (all statements are probabilistic since we start with P(x,y) on the data), where 
we note with fi the minimizer of the empirical risk (11) in T. 

It can been shown (see for example [96]) that in order for the limits in eq. (12) to hold true in 
probability, or more precisely, for the empirical risk minimization principle to be non-trivially 
consistent (see [96] for a discussion about consistency versus non-trivial consistency), the fol- 
lowing uniform law of large numbers (which "translates" to one-sided uniform convergence in 
probability of empirical risk to expected risk in J 7 ) is a necessary and sufficient condition: 

lim P (sup (/[/] - / emp [/; I}) > e 1 = Ve > (13) 

'-*<» [f&F J 

Intuitively, if T is very "large" then we can always find fi e T with empirical error. This 
however does not guarantee that the expected risk of f\ is also close to 0, or close to I[fo\. 
Typically in the literature the two-sided uniform convergence in probability: 

lim P (sup |/[/] - I emp [f; l}\ > el = Ve > (14) 

is considered, which clearly implies (13). In this paper we focus on the stronger two-sided case 
and note that one can get one-sided uniform convergence with some minor technical changes to 
the theory. We will not discuss the technical issues involved in the relations between consistency, 
non-trivial consistency, two-sided and one-sided uniform convergence (a discussion can be found 
in [96]), and from now on we concentrate on the two-sided uniform convergence in probability, 
which we simply refer to as uniform convergence. 

The theory of uniform convergence of ERM has been developed in [97, 98, 99, 94, 96]. It has 
also been studied in the context of empirical processes [29, 74, 30]. Here we summarize the main 
results of the theory. 



In the case that V is (y — /(x)) 2 , the minimizer of eq. (10) is the regression function /o(x) = J yP(y\x)dy 
7 It is important to notice that the data terms (4), (5) and (6) are used for the empirical risks I emp . 
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2.1 Uniform Convergence and the Vapnik-Chervonenkis bound 

Vapnik and Chervonenkis [97, 98] studied under what conditions uniform convergence of the 
empirical risk to expected risk takes place. The results are formulated in terms of three important 
quantities that measure the complexity of a set of functions: the VC entropy, the annealed VC 
entropy, and the growth function. We begin with the definitions of these quantities. First 
we define the minimal e-net of a set, which intuitively measures the "cardinality" of a set at 
"resolution" e: 

Definition 2.1 Let A be a set in a metric space A with distance metric d. For a fixed e > 0, 
the set B C A is called an e-net of A in A, if for any point a G A there is a point b G B such 
that d(a, b) < e. We say that the set B is a minimal e-net of A in A, if it is finite and contains 
the minimal number of elements. 

Given a training set D\ = {(xj,t/j) G X x Y} l i=1 , consider the set of /-dimensional vectors: 

9(/) = (V(yi,/(x 1 )),..,V( M ,/(x,))) (15) 

with / G T , and define the number of elements of the minimal e-net of this set under the metric: 

d(q(f), q(f)) = max \V(y l} /(*)) - V(y t , f(x,))| 

to be M r {e; Di) (which clearly depends both on T and on the loss function V). Intuitively this 
quantity measures how many different functions effectively we have at "resolution" e, when we 
only care about the values of the functions at points in D\. Using this quantity we now give the 
following definitions: 

Definition 2.2 Given a set X xY and a probability P(x,y) defined over it, the VC entropy of 
a set of functions V(y, /(x)), / G T , on a data set of size I is defined as: 



i 
E T {e- 1)= In A/^(e; A) II P ( x *> Vi)^id Vi 



Definition 2.3 Given a set X x Y and a probability P(x,y) defined over it, the annealed VC 
entropy of a set of functions V(y, /(x)), / G J- ' , on a data set of size I is defined as: 

tff nn (e; = In / M r (e; D t ) J\ P(x„ y^d^dy, 

JX,Y - =1 



Definition 2.4 Given a set X xY , the growth function of a set of functions V(y, /(x)) ; / G J- ' , 
on a data set of size I is defined as: 

G^(e; Z) = In ( sup A/^(e;A)J 



Notice that all three quantities are functions of the number of data / and of e, and that clearly: 

H :F (e;l)<H^ nn (e;l)<G :F (e;l). 

These definitions can easily be extended in the case of indicator functions, i.e. functions taking 
binary values 8 such as { — 1, 1}, in which case the three quantities do not depend on e for e < 1, 
since the vectors (15) are all at the vertices of the hypercube {0, 1}'. 
Using these definitions we can now state three important results of statistical learning theory 

[96]: 

• For a given probability distribution P(x.,y): 

1. The necessary and sufficient condition for uniform convergence is that 

l im J^M = Ve > 

2. A sufficient condition for fast asymptotic rate of convergence 9 is that 

lim ann ; ' ' = Ve>0 

It is an open question whether this is also a necessary condition. 

• A sufficient condition for distribution independent (that is, for any P(x.,y)) fast rate of 
convergence is that 

lim ^M = o Ve > 
For indicator functions this is also a necessary condition. 

According to statistical learning theory, these three quantities are what one should consider when 
designing and analyzing learning machines: the VC-entropy and the annealed VC-entropy for 
an analysis which depends on the probability distribution P(x, y) of the data, and the growth 
function for a distribution independent analysis. In this paper we consider only distribution 
independent results, although the reader should keep in mind that distribution dependent results 
are likely to be important in the future. 

Unfortunately the growth function of a set of functions is difficult to compute in practice. So 
the standard approach in statistical learning theory is to use an upper bound on the growth 
function which is given using another important quantity, the VC- dimension, which is another 
(looser) measure of the complexity, capacity, of a set of functions. In this paper we concentrate 
on this quantity, but it is important that the reader keeps in mind that the VC-dimension is 
in a sense a "weak" measure of complexity of a set of functions, so it typically leads to loose 
upper bounds on the growth function: in general one is better off, theoretically, using directly 
the growth function. We now discuss the VC-dimension and its implications for learning. 
The VC-dimension was first defined for the case of indicator functions and then was extended to 
real valued functions. 



8 In the case of indicator functions, y is binary, and V is for f(x) = y, 1 otherwise. 

9 This means that for any I > Iq we have that P{supj- e:F \I[f] — iemp[/]| > e} < e ce l for some constant c > 0. 
Intuitively, fast rate is typically needed in practice. 

8 



Definition 2.5 The VC-dimension of a set {#(/(x)),/ G JF} 7 of indicator functions is the 
maximum number h of vectors Xi, . . . , X/, that can be separated into two classes in all 2 h possible 
ways using functions of the set. 

If, for any number N, it is possible to find N points xi, . . . , xyv that can be separated in all the 
2 N possible ways, we will say that the VC-dimension of the set is infinite. 

The remarkable property of this quantity is that, although as we mentioned the VC-dimension 
only provides an upper bound to the growth function, in the case of indicator functions, finiteness 
of the VC-dimension is a necessary and sufficient condition for uniform convergence (eq. (14)) 
independent of the underlying distribution P(x,y). 

Definition 2.6 Let A < V(y, /(x)) < B, f G T, with A and B < oo. The VC-dimension 
of the set {V(y, /(x)), / G J-} is defined as the VC-dimension of the set of indicator functions 
{0(V(yJ{x))-a), ae(A,B)}. 

Sometimes we refer to the VC-dimension of {V(y, /(x)), / G J 7 } as the VC dimension of V in T . 
It can be easily shown that for y G { — 1, +1} and for V(y, /(x)) = 6(—yf(x)) as the loss function, 
the VC dimension of V in T computed using definition 2.6 is equal to the VC dimension of the 
set of indicator functions {0(/(x)), / G J-} computed using definition 2.5. In the case of real 
valued functions, finiteness of the VC-dimension is only sufficient for uniform convergence. Later 
in this section we will discuss a measure of capacity that provides also necessary conditions. 
An important outcome of the work of Vapnik and Chervonenkis is that the uniform deviation 
between empirical risk and expected risk in a hypothesis space can be bounded in terms of the 
VC-dimension, as shown in the following theorem: 

Theorem 2.1 (Vapnik and Chervonenkis 1971) Let A < V(y,f(x)) <B,f^T,Tbea set 
of bounded functions and h the VC-dimension of V in T . Then, with probability at least 1 — rj, 
the following inequality holds simultaneously for all the elements f of T: 

I emp [/; l]-(B- A)sj hln ^~ H ^ < /[/] < I emp [/; l] + (B- A) ^^ ~ H *\ (16) 



The quantity |/[/] — I emp [/; Z]| is often called estimation error, and bounds of the type above are 
usually called VC bounds 1 ® . From eq. (16) it is easy to see that with probability at least 1 — n: 




lh\n— - lnf 2 ) 
< /[/o] < /[/,] + 2(B - A) J h UJ (17) 



I 

where /; is, as in (12), the minimizer of the empirical risk in T . 

A very interesting feature of inequalities (16) and (17) is that they are non-asymptotic, meaning 
that they hold for any finite number of data points I, and that the error bounds do not necessarily 
depend on the dimensionality of the variable x. 

Observe that theorem (2.1) and inequality (17) are meaningful in practice only if the VC- 
dimension of the loss function V in T is finite and less than /. Since the space T where the 



1 It is important to note that bounds on the expected risk using the annealed VC-entropy also exist. These 
are tighter than the VC-dimension ones. 



loss function V is defined is usually very large (i.e. all functions in L 2 ), one typically considers 
smaller hypothesis spaces 7i. The cost associated with restricting the space is called the ap- 
proximation error (see below). In the literature, space T where V is defined is called the target 
space, while 7i is what is called the hypothesis space. Of course, all the definitions and analysis 
above still hold for 7i, where we replace fo with the minimizer of the expected risk in 7i, f\ is 
now the minimizer of the empirical risk in Ti, and h the VC-dimension of the loss function V 
in 7i. Inequalities (16) and (17) suggest a method for achieving good generalization: not only 
minimize the empirical risk, but instead minimize a combination of the empirical risk and the 
complexity of the hypothesis space. This observation leads us to the method of Structural Risk 
Minimization that we describe next. 

2.2 The method of Structural Risk Minimization 

The idea of SRM is to define a nested sequence of hypothesis spaces H i C H 2 C . . . C H n ^ with 
n(l) a non-decreasing integer function of I, where each hypothesis space Hi has VC-dimension 
finite and larger than that of all previous sets, i.e. if hi is the VC-dimension of space Hi, then 
hi < hi < . . . < h n (iy For example Hi could be the set of polynomials of degree i, or a set of 
splines with i nodes, or some more complicated nonlinear parameterization. For each element 
H i of the structure the solution of the learning problem is: 

fi,i = arg min I emp [f; 1} (18) 

Because of the way we define our structure it should be clear that the larger i is the smaller 
the empirical error of fij is (since we have greater "flexibility" to fit our training data), but the 
larger the VC-dimension part (second term) of the right hand side of (16) is. Using such a nested 
sequence of more and more complex hypothesis spaces, the SRM learning technique consists of 
choosing the space if n *(z) for which the right hand side of inequality (16) is minimized. It can 
be shown [94] that for the chosen solution f n *(i)j inequalities (16) and (17) hold with probability 
at least (1 — 77) n ^ w 1 — n(l)r] n , where we replace h with h n *(i), fo with the minimizer of the 
expected risk in H n * { i), namely f n *(i), and /; with f n *(i),i- 

With an appropriate choice of n(/) 12 it can be shown that as / — > oo and n(l) — > oo, the expected 
risk of the solution of the method approaches in probability the minimum of the expected risk 
in Ti = Ui^;L Hi, namely I[fn]- Moreover, if the target function f belongs to the closure of 7i, 
then eq. (12) holds in probability (see for example [96]). 

However, in practice / is finite ("small"), so n{l) is small which means that Ti = Ul=i Hi is a 
small space. Therefore I[fn] ma y be much larger than the expected risk of our target function 
fo, since fo may not be in 7i. The distance between I[fn] and I[fo] is called the approximation 
error and can be bounded using results from approximation theory. We do not discuss these 
results here and refer the reader to [54, 26]. 

2.3 e-uniform convergence and the V 1 dimension 

As mentioned above finiteness of the VC-dimension is not a necessary condition for uniform 
convergence in the case of real valued functions. To get a necessary condition we need a slight 



11 We want (16) to hold simultaneously for all spaces Hi, since we choose the best fn. 
12 Various cases are discussed in [27], i.e. n(l) = I. 
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extension of the VC-dimension that has been developed (among others) in [50, 2], known as 
the ^-dimension 13 . Here we summarize the main results of that theory that we will also use 
later on to design regression machines for which we will have distribution independent uniform 
convergence. 

Definition 2.7 Let A < V(y,f(x)) < B, f G T , with A and B < oo. The V^-dimension of 
V in T (of the set {V(y, /(x)), / G J 7 }) is defined as the the maximum number h of vectors 
(xi,?/i) . . . , (x h ,y h ) that can be separated into two classes in all 2 h possible ways using rules: 

class 1 if: V{y h /(x;)) > s + 7 
class if: V{y u /(x*)) < s - 7 

for f G T and some s > 0. If, for any number N , it is possible to find N points (xi, yi) . . . , (xjv, Vn) 
that can be separated in all the 2 N possible ways, we will say that the V^-dimension ofV in T is 
infinite. 

Notice that for 7 = this definition becomes the same as definition 2.6 for VC-dimension. 

Intuitively, for 7 > the "rule" for separating points is more restrictive than the rule in the case 

7 = 0. It requires that there is a "margin" between the points: points for which V(y,f(x)) is 

between s + 7 and s — 7 are not classified. As a consequence, the V^ dimension is a decreasing 

function of 7 and in particular is smaller than the VC-dimension. 

If V is an indicator function, say 0(— y/(x)), then for any 7 definition 2.7 reduces to that of the 

VC-dimension of a set of indicator functions. 

Generalizing slightly the definition of eq. (14) we will say that for a given e > the ERM method 

converges e-uniformly in T in probability, (or that there is e-uniform convergence) if: 

lim P (sup |/ cmp [/; /] - I[f}\ > el = 0. (19) 

Notice that if eq. (19) holds for every e > we have uniform convergence (eq. (14)). It can be 
shown (variation of [96]) that e-uniform convergence in probability implies that: 

I[fi] < /[/o] + 2e (20) 

in probability, where, as before, /; is the minimizer of the empirical risk and /o is the minimizer 

of the expected expected risk in JF 14 . 

The basic theorems for the V^-dimension are the following: 

Theorem 2.2 (Alon et al. , 1993 ) Let A < V(j/,/(x))) < B , f G T ', T be a set of bounded 
functions. For any e > 0, if the V 1 dimension of V in T is finite for 7 = ae for some constant 
ot > ~h> then the ERM method e-converges in probability. 

Theorem 2.3 (Alon et al. , 1993 ) Let A < V(y,/(x))) < B , f G T ', T be a set of bounded 
functions. The ERM method uniformly converges (in probability) if and only if the V 1 dimension 
of V in T is finite for every 7 > 0. So finiteness of the V 1 dimension for every 7 > is a 
necessary and sufficient condition for distribution independent uniform convergence of the ERM 
method for real-valued functions. 



13 In the literature, other quantities, such as the fat- shattering dimension and the P 7 dimension, are also defined. 
They are closely related to each other, and are essentially equivalent to the V 1 dimension for the purpose of this 
paper. The reader can refer to [2, 7] for an in-depth discussion on this topic. 

14 This is like e-learnability in the PAC model [93]. 
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Theorem 2.4 (Alon et al. , 1993 ) Let A < V(j/,/(x)) < B , f ^ T , T be a set of bounded 
functions. For any e > 0, for all I > \ we have that if /i 7 is the V 1 dimension of V in T for 
7 = cue (a > -^), h 1 finite, then: 

pjsup|/ cmp [/;Z] -J[/]| > ej < 0(6,1,^), (21) 

where Q is an increasing function of /i 7 and a decreasing function of e and I, with Q — > as 

l^oo 15 . 

From this theorem we can easily see that for any e > 0, for all / > \: 

P [l[fi] < I[fo] + 2e} > 1 - 2Q(e, I, h 7 ), (22) 

where fi is, as before, the minimizer of the empirical risk in J- '. An important observations to keep 
in mind is that theorem 2.4 requires the V^ dimension of the loss function V in T . In the case 
of classification, this implies that if we want to derive bounds on the expected misclassification 
we have to use the V^ dimension of the loss function 9(— y/(x)) (which is the VC — dimension 
of the set of indicator functions {sgn (/(x)), / G J 7 }), and not the V^ dimension of the set T . 
The theory of the V 1 dimension justifies the "extended" SRM method we describe below. It is 
important to keep in mind that the method we describe is only of theoretical interest and will 
only be used later as a theoretical motivation for RN and SVM. It should be clear that all the 
definitions and analysis above still hold for any hypothesis space 7i, where we replace /o with 
the minimizer of the expected risk in 7i, f\ is now the minimizer of the empirical risk in 7i, and 
h the VC-dimension of the loss function V in 7i. 

Let / be the number of training data. For a fixed e > such that I > ^, let 7 = ^e, and 
consider, as before, a nested sequence of hypothesis spaces H 1 C H 2 C . . . C H n ^, where each 
hypothesis space Hi has Ky-dimension finite and larger than that of all previous sets, i.e. if hi is 
the V^-dimension of space Hi, then hi < hi < . . . < h n n^. For each element Hi of the structure 
consider the solution of the learning problem to be: 

fi,l = arg mm J emp [/;/]. (23) 

Because of the way we define our structure the larger i is the smaller the empirical error of fa is 
(since we have more "flexibility" to fit our training data), but the larger the right hand side of 
inequality (21) is. Using such a nested sequence of more and more complex hypothesis spaces, 
this extended SRM learning technique consists of finding the structure element H n *nA for which 
the trade off between empirical error and the right hand side of (21) is optimal. One practical 
idea is to find numerically for each Hi the "effective" q so that the bound (21) is the same for 
all Hi, and then choose fij for which the sum of the empirical risk and e; is minimized. 
We conjecture that as / — > 00, for appropriate choice of n(l, e) with n(l, e) — > 00 as I — > 00, the 
expected risk of the solution of the method converges in probability to a value less than 2e away 
from the minimum expected risk in 7i = U^i Hi. Notice that we described an SRM method for 
a fixed e. If the V^ dimension of Hi is finite for every 7 > 0, we can further modify the extended 
SRM method so that e — > as / — > 00. We conjecture that if the target function /0 belongs to the 



15 Closed forms of Q can be derived (see for example [2]) but we do not present them here for simplicity of 
notation. 



12 



closure of 7i, then as I — > oo, with appropriate choices of e, n(l, e) and n*(l, e) the solution of this 
SRM method can be proven (as before) to satisfy eq. (12) in probability. Finding appropriate 
forms of e, n(l, e) and n*(l, e) is an open theoretical problem (which we believe to be a technical 
matter). Again, as in the case of "standard" SRM, in practice / is finite so 7i = [X=i H is 
a small space and the solution of this method may have expected risk much larger that the 
expected risk of the target function. Approximation theory can be used to bound this difference 
[61]. 

The proposed method is difficult to implement in practice since it is difficult to decide the 
optimal trade off between empirical error and the bound (21). If we had constructive bounds on 
the deviation between the empirical and the expected risk like that of theorem 2.1 then we could 
have a practical way of choosing the optimal element of the structure. Unfortunately existing 
bounds of that type [2, 7] are not tight. So the final choice of the element of the structure may 
be done in practice using other techniques such as cross-validation [102]. 

2.4 Overview of our approach 

In order to set the stage for the next two sections on regularization and Support Vector Machines, 
we outline here how we can justify the proper use of the RN and the SVM functionals (see (3)) 
in the framework of the SRM principles just described. 

The basic idea is to define a structure in terms of a nested sequence of hypothesis spaces Hi C 
Hi C . . . C H n m with H m being the set of functions / in the RKHS with: 

||/|k<4», (24) 

where A m is a monotonically increasing sequence of positive constants. Following the SRM 
method outlined above, for each m we will minimize the empirical risk 

1 8=1 

subject to the constraint (24). This in turn leads to using the Lagrange multiplier X m and to 
minimizing 

jJ2V(yiJ( Xi )) + X m (\\f\\ 2 K-A 



h=l 



mil 



with respect to / and maximizing with respect to X m > for each element of the structure. We 
can then choose the optimal n*{l) and the associated A*(/), and get the optimal solution f n *(i)- 
The solution we get using this method is clearly the same as the solution of: 

\i:V( m J(^)) + y(l)\\f\\ 2 K (25) 

1 8=1 

where A*(Z) is the optimal Lagrange multiplier corresponding to the optimal element of the 
structure A n *yy Notice that this approach is quite general. In particular it can be applied to 
classical Li regularization, to SVM regression, and, as we will see, to SVM classification with 
the appropriate V(-, ■). 

In section 6 we will describe in detail this approach for the case that the elements of the structure 
are infinite dimensional RKHS. We have outlined this theoretical method here so that the reader 
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understands our motivation for reviewing in the next two sections the approximation schemes 
resulting from the minimization of functionals of the form of equation (25) for three specific 
choices of the loss function V: 

• V(Vi /( x )) — (y — /( x )) 2 f° r regularization. 

• y{Vi /( x )) — \y ~ /( x )|e f° r SVM regression. 

• V(Vi /( x )) — |1 — 2//( x )l + f° r SVM classification. 
For SVM classification the loss functions: 

• V{Vi /( x )) — 0{1 — 7//(x)) (hard margin loss function), and 

• V(y, /(x)) = $(—yf(x)) (misclassification loss function) 

will also be discussed. First we present an overview of RKHS which are the hypothesis spaces 
we consider in the paper. 

3 Reproducing Kernel Hilbert Spaces: a brief overview 

A Reproducing Kernel Hilbert Space (RKHS) [5] is a Hilbert space 7i of functions defined over 
some bounded domain X C R d with the property that, for each xel, the evaluation functionals 
JF X defined as 

FJJ\ = /(x) VfeH 

are linear, bounded functionals. The boundedness means that there exists a U — U x e R + such 
that: 

\r*\f\\ = \f(x)\<u\\f\\ 

for all / in the RKHS. 

It can be proved [102] that to every RKHS 7i there corresponds a unique positive definite function 
K(x., y) of two variables in X, called the reproducing kernel of 7i (hence the terminology RKHS), 
that has the following reproducing property: 

/(x) =< /(y), K(y, x) > n V/ G H, (26) 

where < •, • >n denotes the scalar product in 7i. The function K behaves in 7i as the delta 
function does in L 2 , although L 2 is not a RKHS (the functionals JF X are clearly not bounded). 
To make things clearer we sketch a way to construct a RKHS, which is relevant to our paper. 
The mathematical details (such as the convergence or not of certain series) can be found in the 
theory of integral equations [45, 20, 23]. 

Let us assume that we have a sequence of positive numbers A n and linearly independent functions 
</> n (x) such that they define a function K(x.,y) in the following way 16 : 

oo 

*T(x,y) = ]TA n n (x)</>„(y), (27) 

ra=0 



16 When working with complex functions 0„(x) this formula should be replaced with K(x, y) 

£~=„*n^(*)#l(y) 
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where the series is well defined (for example it converges uniformly). A simple calculation shows 
that the function A defined in eq. (27) is positive definite. Let us now take as our Hilbert space 
to be the set of functions of the form: 

oo 

/(x) = $> n n (x) (28) 

n=0 

for any a n G R, and define the scalar product in our space to be: 

OO OO OO J 

< Y, a «0n(x), Y d n <t>n{*) >H= Y "T - ^- ( 29 ) 

ra=0 n=0 n=0 ^ n 

Assuming that all the evaluation functionals are bounded, it is now easy to check that such an 
Hilbert space is a RKHS with reproducing kernel given by A(x, y). In fact we have: 

< /(y), A(y, x) > n = Y a " A " 0ra(x) = Y «^n(x) = /(x), (30) 

hence equation (26) is satisfied. 

Notice that when we have a finite number of <f> n , the X n can be arbitrary (finite) numbers, since 

convergence is ensured. In particular they can all be equal to one. 

Generally, it is easy to show [102] that whenever a function A of the form (27) is available, it 

is possible to construct a RKHS as shown above. Vice versa, for any RKHS there is a unique 

kernel A and corresponding A n , 0„, that satisfy equation (27) and for which equations (28), (29) 

and (30) hold for all functions in the RKHS. Moreover, equation (29) shows that the norm of 

the RKHS has the form: 

\\f\\ 2 K = YT L - < 31 ) 

n=0 An 

The <p n consist a basis for the RKHS (not necessarily orthonormal), and the kernel A is the 
"correlation" matrix associated with these basis functions. It is in fact well know that there is a 
close relation between Gaussian processes and RKHS [58, 40, 72]. Wahba [102] discusses in depth 
the relation between regularization, RKHS and correlation functions of Gaussian processes. The 
choice of the n defines a space of functions - the functions that are spanned by the <p n . 
We also call the space {(0n( x ))^li > x £ A} the feature space induced by the kernel A. The 
choice of the <f) n defines the feature space where the data x are "mapped" . In this paper we refer 
to the dimensionality of the feature space as the dimensionality of the RKHS. This is clearly 
equal to the number of basis elements <p n , which does not necessarily have to be infinite. For 
example, with A a Gaussian, the dimensionality of the RKHS is infinite (<p n (x) are the Fourier 
components e mx ), while when A is a polynomial of degree k (A(x, y) = (1 + x ■ y) k - see section 
4), the dimensionality of the RKHS is finite, and all the infinite sums above are replaced with 
finite sums. 

It is well known that expressions of the form (27) actually abound. In fact, it follows from 
Mercer's theorem [45] that any function A(x, y) which is the kernel of a positive operator 17 
in L 2 (fi) has an expansion of the form (27), in which the fa and the Aj are respectively the 
orthogonal eigenfunctions and the positive eigenvalues of the operator corresponding to A. In 



17 We remind the reader that positive definite operators in L^ are self-adjoint operators such that < Kf, / > > 
for all / e L 2 . 
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[91] it is reported that the positivity of the operator associated to K is equivalent to the statement 
that the kernel K is positive definite, that is the matrix K^ = JFf(xj,Xj) is positive definite for 
all choices of distinct points Xj e X. Notice that a kernel K could have an expansion of the form 
(27) in which the <fi n are not necessarily its eigenf unctions. The only requirement is that the <fi n 
are linearly independent but not necessarily orthogonal. 

In the case that the space X has finite cardinality, the "functions" / are evaluated only at a finite 
number of points x. If M is the cardinality of X, then the RKHS becomes an M-dimensional 
space where the functions / are basically M-dimensional vectors, the kernel K becomes anMxM 
matrix, and the condition that makes it a valid kernel is that it is a symmetric positive definite 
matrix (semi-definite if M is larger than the dimensionality of the RKHS). Positive definite 
matrices are known to be the ones which define dot products, i.e. fKf T > for every / in the 
RKHS. The space consists of all M-dimensional vectors / with finite norm fKf T . 
Summarizing, RKHS are Hilbert spaces where the dot product is defined using a function K(x, y) 
which needs to be positive definite just like in the case that X has finite cardinality. The elements 
of the RKHS are all functions / that have a finite norm given by equation (31). Notice the 
equivalence of a) choosing a specific RKHS 7i b) choosing a set of <p n and X n c) choosing a 
reproducing kernel K. The last one is the most natural for most applications. A simple example 
of a RKHS is presented in Appendix B. 

Finally, it is useful to notice that the solutions of the methods we discuss in this paper can be 
written both in the form (2), and in the form (28). Often in the literature formulation (2) is 
called the dual form of /, while (28) is called the primal form of /. 

4 Regularization Networks 

In this section we consider the approximation scheme that arises from the minimization of the 
quadratic functional 

™* H \f\ = 7 X> - /( x *)) 2 + MlfWx ( 32 ) 

fen I i=1 

for a fixed A. Formulations like equation (32) are a special form of regularization theory developed 
by Tikhonov, Ivanov [92, 46] and others to solve ill-posed problems and in particular to solve 
the problem of approximating the functional relation between x and y given a finite number of 
examples D = {xj, i)i}\ =1 . As we mentioned in the previous sections our motivation in this paper 
is to use this formulation as an approximate implementation of Vapnik's SRM principle. 
In classical regularization the data term is an Z>2 loss function for the empirical risk, whereas the 
second term - called stabilizer - is usually written as a functional £l(f) with certain properties 
[92, 69, 39]. Here we consider a special class of stabilizers, that is the norm \\f\\ 2 K in a RKHS 
induced by a symmetric, positive definite function K(x,y). This choice allows us to develop a 
framework of regularization which includes most of the usual regularization schemes. The only 
significant omission in this treatment - that we make here for simplicity - is the restriction on 
K to be symmetric positive definite so that the stabilizer is a norm. However, the theory can 
be extended without problems to the case in which K is positive semidefinite, in which case the 
stabilizer is a semi-norm [102, 56, 31, 33]. This approach was also sketched in [90]. 
The stabilizer in equation (32) effectively constrains / to be in the RKHS defined by K. It is 
possible to show (see for example [69, 39]) that the function that minimizes the functional (32) 
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has the form: 



/(x) = $>tf(x >Xi ), (33) 

where the coefficients q depend on the data and satisfy the following linear system of equations: 

(K + A/)c = y (34) 

where / is the identity matrix, and we have defined 

(y)» = Vi , ( c )i = °i , ( K )ij = K ^ x i)- 

It is remarkable that the solution of the more general case of 

rmnH[f] = jJ2 V (yi-f(xi)) + MlfWl, (35) 

fen I . =1 

where the function V is any differentiable function, is quite similar: the solution has exactly the 
same general form of (33), though the coefficients cannot be found anymore by solving a linear 
system of equations as in equation (34) [37, 40, 90]. For a proof see Appendix C. 
The approximation scheme of equation (33) has a simple interpretation in terms of a network 
with one layer of hidden units [71, 39]. Using different kernels we get various RN's. A short list 
of examples is given in Table 1. 



Kernel Function 


Regularization Network 


K(x - y) = exp(-| x - y|| 2 ) 


Gaussian RBF 


fr(x-y) = (|x-y|| 2 + c 2 )-5 


Inverse Multiquadric 


^(x-y) = (||x-y|| 2 + c 2 )^ 


Multiquadric 


K(x-y) = ||x-y|| 2n+1 
^(x-y) = ||x-y|| 2 Mn(||x-y||) 


Thin plate splines 


K(x, y) = tanh(x • y — 6) 


(only for some values of 6) 
Multi Layer Perceptron 


^(x,y) = (l + x-y) rf 


Polynomial of degree d 


K(x,y) = B 2n +i(x-y) 


B-splines 


ZT(„ q .\ _ sin(d+l/2)(x-j/j 


Trigonometric polynomial of degree d 



Table 1: Some possible kernel functions. The first four are radial kernels. The multiquadric and 
thin plate splines are positive semidefinite and thus require an extension of the simple RKHS 
theory of this paper. The last three kernels were proposed by Vapnik [96], originally for SVM. The 
last two kernels are one-dimensional: multidimensional kernels can be built by tensor products 
of one-dimensional ones. The functions B n are piecewise polynomials of degree n, whose exact 
definition can be found in 



\ and in this case the 



When the kernel K is positive semidefinite, there is a subspace of functions / which have norm 
|| f\\ 2 K equal to zero. They form the null space of the functional 
minimizer of (32) has the form [102]: 
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I k 

/(x) = ^CiiC(x,Xi) + ^6 a ?/> a (x), (36) 

where {ip a }a=i * s a basis in the null space of the stabilizer, which in most cases is a set of 
polynomials, and therefore will be referred to as the "polynomial term" in equation (36). The 
coefficients b a and q depend on the data. For the classical regularization case of equation (32), 
the coefficients of equation (36) satisfy the following linear system: 

(K + A/)c + ^ T b = y, (37) 

^c = 0, (38) 

where / is the identity matrix, and we have defined 



'L ■ 



(y)» = v% , ( c )i = °i , ( b )i = h 

When the kernel is positive definite, as in the case of the Gaussian, the null space of the stabilizer 
is empty. However, it is often convenient to redefine the kernel and the norm induced by it so that 
the induced RKHS contains only zero-mean functions, that is functions /i(x) s.t. J x fi(x)dx = 0. 
In the case of a radial kernel K, for instance, this amounts to considering a new kernel 

^(x,y)=^(x,y)-Ao 
without the zeroth order Fourier component, and a norm 

oo Jl 



2 K> = Y. a f- (39) 



The null space induced by the new K' is the space of constant functions. Then the minimizer of 
the corresponding functional (32) has the form: 

l 
/(x) = 5>iT(x, Xi )H-&, (40) 

8=1 

with the coefficients satisfying equations (37) and (38), that respectively become: 

(K' + A/)c + 16 = (K - Xol + A/)c + lb=(K + (\- A )/)c + 16 = y, (41) 

E c * = 0- (42) 

i=i 

Equations (40) and (42) imply that the the minimizer of (32) is of the form: 

i i i 

/(x) = Y, CiK'{ X , Xi) + 6 = Yl Ci{K{x, x,) - Ac) + 6 = ]T c^(x, x,) + 6. (43) 

j=l i=l i=l 

Thus we can effectively use a positive definite K and the constant 6, since the only change in 
equation (41) just amounts to the use of a different A. Choosing to use a non-zero 6 effectively 



means choosing a different feature space and a different stabilizer from the usual case of equation 
(32): the constant feature is not considered in the RKHS norm and therefore is not "penalized". 
This choice is often quite reasonable, since in many regression and, especially, classification 
problems, shifts by a constant in / should not be penalized. 

In summary, the argument of this section shows that using a RN of the form (43) (for a certain 
class of kernels K) is equivalent to minimizing functionals such as (32) or (35). The choice of 
K is equivalent to the choice of a corresponding RKHS and leads to various classical learning 
techniques such as RBF networks. We discuss connections between regularization and other 
techniques in sections 4.2 and 4.3. 

Notice that in the framework we use here the kernels K are not required to be radial or even 
shift-invariant. Regularization techniques used to solve supervised learning problems [69, 39] 
were typically used with shift invariant stabilizers (tensor product and additive stabilizers are 
exceptions, see [39]). We now turn to such kernels. 

4.1 Radial Basis Functions 

Let us consider a special case of the kernel K of the RKHS, which is the standard case in several 
papers and books on regularization [102, 70, 39]: the case in which K is shift invariant, that is 
K(x, y) = K(x — y) and the even more special case of a radial kernel K(x,y) = K(\\x — y||). 
Section 3 implies that a radial positive definite K defines a RKHS in which the "features" (f) n 
are Fourier components that is 

oo oo 

lf(x, y) = £ A n 0„(x)0„(y) = £ A^ 2 ™^- 2 ^. (44) 

ra=0 ra=0 

Thus any positive definite radial kernel defines a RKHS over [0, 1] with a scalar product of the 
form: 

<f , s> ^lMm, (45) 

n=0 An 

where / is the Fourier transform of /. The RKHS becomes simply the subspace of L2QO, l] d ) of 
the functions such that 



l/»l 



2 



£ ^\^- < +00 . (46) 



n 1 ^ 



Functionals of the form (46) are known to be smoothness functionals. In fact, the rate of decrease 
to zero of the Fourier transform of the kernel will control the smoothness property of the function 
in the RKHS. For radial kernels the minimizer of equation (32) becomes: 

/(x)=5>A-(||x -*!!) + 6 (47) 

i=i 

and the corresponding RN is a Radial Basis Function Network. Thus Radial Basis Function 
networks are a special case of RN [69, 39]. 

In fact all translation-invariant stabilizers K(x,Xi) = K(x — x$) correspond to RKHS's where the 
basis functions <p n are Fourier eigenfunctions and only differ in the spectrum of the eigenvalues 
(for a Gaussian stabilizer the spectrum is Gaussian, that is A n = Ae^~ n l 2 ^ (for a = 1)). For 
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example, if A n = for all n > n , the corresponding RKHS consists of all bandlimited functions, 
that is functions with zero Fourier components at frequencies higher than n 18 . Generally X n are 
such that they decrease as n increases, therefore restricting the class of functions to be functions 
with decreasing high frequency Fourier components. 

In classical regularization with translation invariant stabilizers and associated kernels, the com- 
mon experience, often reported in the literature, is that the form of the kernel does not matter 
much. We conjecture that this may be because all translation invariant K induce the same type 
of 4> n features - the Fourier basis functions. 

4.2 Regularization, generalized splines and kernel smoothers 

A number of approximation and learning techniques can be studied in the framework of regu- 
larization theory and RKHS. For instance, starting from a reproducing kernel it is easy [5] to 
construct kernels that correspond to tensor products of the original RKHS; it is also easy to 
construct the additive sum of several RKHS in terms of a reproducing kernel. 

• Tensor Product Splines: In the particular case that the kernel is of the form: 

K{x,y) = Tl d j=l k{x j ,y j ) 

where x^ is the jth coordinate of vector x and k is a positive definite function with one- 
dimensional input vectors, the solution of the regularization problem becomes: 

/(x) = X>n? =1 &(4y) 

i 

Therefore we can get tensor product splines by choosing kernels of the form above [5] . 

• Additive Splines: In the particular case that the kernel is of the form: 

d 

where x^ is the jth coordinate of vector x and A; is a positive definite function with one- 
dimensional input vectors, the solution of the regularization problem becomes: 



/(x) = $>(£*(*£ a*')) = $3(5>*te>')) = £/. 



x 



,r 



i j=l j=l i jr'=l 

So in this particular case we get the class of additive approximation schemes of the form: 

d 



/(x)=E/: 



x 



J-- 



i=i 



A more extensive discussion on relations between known approximation methods and regulariza- 
tion can be found in [39]. 



18 The simplest K is then K(x, y) — sinc(x — y), or kernels that are convolution with it. 
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4.3 Dual representation of Regularization Networks 

Every RN can be written as 

/(x) = c • K(x) (48) 

where K(x) is the vector of functions such that (K(x)), = K(x, X,). Since the coefficients c 
satisfy the equation (34), equation (48) becomes 

/(x) = (^ + A/)- 1 yK(x). 
We can rewrite this expression as 

/(x) = 5>Mx) = y • Mx) (49) 

i=i 

in which the vector b(x) of basis functions is defined as: 

b(x) = (A" + AJ) _1 K(x) (50) 

and now depends on all the data points and on the regularization parameter A. The representation 
(49) of the solution of the approximation problem is known as the dual 19 of equation (48), 
and the basis functions fej(x) are called the equivalent kernels, because of the similarity with 
the kernel smoothing technique [88, 41, 43]. Notice that, while in equation (48) the difficult 
part is the computation of coefficients q, the kernel function lf(x, x$) being predefined, in the 
dual representation (49) the difficult part is the computation of the basis function 6j(x), the 
coefficients of the expansion being explicitly given by the yi. 

As observed in [39], the dual representation of a RN shows clearly how careful one should be in 
distinguishing between local vs. global approximation techniques. In fact, we expect (see [88] 
for the 1-D case) that in most cases the kernels 6;(x) decrease with the distance of the data 
points Xj from the evaluation point, so that only the neighboring data affect the estimate of the 
function at x, providing therefore a "local" approximation scheme. Even if the original kernel 
K is not "local", like the absolute value |x| in the one-dimensional case or the multiquadric 



-ftT(x) = i/l + ||x|| 2 , the basis functions 6«(x) are bell shaped, local functions, whose locality will 
depend on the choice of the kernel K, on the density of data points, and on the regularization 
parameter A. This shows that apparently "global" approximation schemes can be regarded as 
local, memory-based techniques (see equation 49) [59]. 

4.4 From regression to classification 

So far we only considered the case that the unknown function can take any real values, specifically 
the case of regression. In the particular case that the unknown function takes only two values, 
i.e. -1 and 1, we have the problem of binary pattern classification, i.e. the case where we are 
given data that belong to one of two classes (classes -1 and 1) and we want to find a function 
that separates these classes. It can be shown [28] that, if V in equation (35) is (y — /(x)) 2 , and 
if K defines a finite dimensional RKHS, then the minimizer of the equation 



19 Notice that this "duality" is different from the one mentioned at the end of section 3. 
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#[/] = y£(/( x O-^) 2 + A||/||L (51) 

for A — > approaches asymptotically the function in the RKHS that is closest in the L 2 norm to 
the regression function: 

/o(x) = P(y = l|x) - P(y = -l|x) (52) 

The optimal Bayes rule classifier is given by thresholding the regression function, i.e. by 
sign(/o(x)). Notice that in the case of infinite dimensional RKHS asymptotic results ensur- 
ing consistency are available (see [27], theorem 29.8) but depend on several conditions that are 
not automatically satisfied in the case we are considering. The Bayes classifier is the best clas- 
sifier, given the correct probability distribution P. However, approximating function (52) in the 
RKHS in L 2 does not necessarily imply that we find the best approximation to the Bayes classi- 
fier. For classification, only the sign of the regression function matters and not the exact value of 
it. Notice that an approximation of the regression function using a mean square error criterion 
places more emphasis on the most probable data points and not on the most "important" ones 
which are the ones near the separating boundary. 

In the next section we will study Vapnik's more natural approach to the problem of classification 
that is based on choosing a loss function V different from the square error. This approach leads 
to solutions that emphasize data points near the separating surface. 

5 Support vector machines 

In this section we discuss the technique of Support Vector Machines (SVM) for Regression 
(SVMR) [95, 96] in terms of the SVM functional. We will characterize the form of the solu- 
tion and then show that SVM for binary pattern classification can be derived as a special case 
of the regression formulation. 

5.1 SVM in RKHS 

Once again the problem is to learn a functional relation between x and y given a finite number 

of examples D = {xj, yi}\ =1 . 

The method of SVMR [96] corresponds to the following functional 

H[f\ = )i:\yi-f(Xi)\e + M\f\\ 2 K (53) 

which is a special case of equation (35) and where 

t// n, _ I I _ / if \x\ < e , .-, 

1 \x\ — e otherwise, 

is the e— Insensitive Loss Function (ILF) (also noted with L e ). Note that the ILF assigns zero 
cost to errors smaller then e. In other words, for the cost function | • | € any function closer than e 
to the data points is a perfect interpolant. We can think of the parameter e as the resolution at 
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which we want to look the data. For this reason we expect that the larger e is, the simpler the 
representation will be. We will come back to this point in section 8. 

The minimizer of H in the RKHS 7i defined by the kernel K has the general form given by 
equation (43), that is 

/(:r) = $>^(x J ,x) + &, (55) 

j=i 

where we can include the constant b for the same reasons discussed in section 4. 
In order to find the solution of SVM we have to minimize functional (53) (with V given by 
equation (54)) with respect to /. Since it is difficult to deal with the function V(x) = \x\ e , the 
above problem is replaced by the following equivalent problem (by equivalent we mean that the 
same function minimizes both functionals), in which an additional set of variables is introduced: 



Problem 5.1 



subject to the constraints: 



minc|>( /) e,r) = yE(6 + C) + ^ll/lli (56) 



/(*»)- & < e + 6 i = l,..., I 

Vi-f(*i) < e + C* i = l,...,l (57) 

&,£ > i = l,. ..,1. 

The parameter C in (56) has been introduced in order to be consistent with the standard SVM 
notations [96]. Note that A in eq. (53) corresponds to ^. The equivalence is established just 
noticing that in problem (5.1) a (linear) penalty is paid only when the absolute value of the 
error exceeds e, (which correspond to the Vapnik's ILF). Notice that if either of the two top 
constraints is satisfied with some non-zero £j (or £*), the other is automatically satisfied with a 
zero value for £* (or £j). 

Problem (5.1) can be solved through the technique of Lagrange multipliers. For details see [96]. 
The result is that the function which solves problem (5.1) can be written as: 

i 

/( X ) = J2( a i ~ a i) K (*i> X ) + & ' 

8=1 

where a* and a,i are the solution of the following QP-problem: 

Problem 5.2 

/ i i i 



min W(ct,ct*) = e^(a* + a») - ^2yi(a* - on) + - ^ (a* - «»)(«* - a,-)if(xj,x 
subject to the constraints: 






^2(a* - (Xi) = 0, 

8=1 

c 

7' 



< a*, cti < — , i = 1, . . . ,1. 
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Oi(/(xi) -2/i-e-Ci) =° 

a*(^-/(x,)-e-e*)=0 


i = l,.. 
i = l,.. 


..,/ 
..,/ 


(y - «*)& = 


i = l,.. 


..,/ 


(y - a»£ = 


i = l,.. 


.,/. 



The solutions of problems (5.1) and (5.2) are related by the Kuhn- Tucker conditions: 



(58) 
(59) 

(60) 

(61) 

The input data points Xj for which ctj or a* are different from zero are called support vectors 
(SVs). Observe that on and a* cannot be simultaneously different from zero, so that the constraint 
ctjCi* = holds true. Any of the SVs for which < a j < j (and therefore £j = 0) can be used to 
compute the parameter b. In fact, in this case it follows from the Kuhn- Tucker conditions that: 

i 

/( x i) = XX a «* - oa)K(xi, Xj) + b = Vj + e. 
j=i 

from which b can be computed. The SVs are those data points x, at which the error is either 
greater or equal to e 20 . Points at which the error is smaller than e are never support vectors, and 
do not enter in the determination of the solution. A consequence of this fact is that if the SVM 
were run again on the new data set consisting of only the SVs the same solution would be found. 
Finally observe that if we call Ci = a* — aij, we recover equation (55). With respect to the new 
variable q problem (5.2) becomes: 



Problem 5.3 



i i ii 

mmE[c] = - ]T acjKfaXj) -^c^ + e]T |q 



i,j=l i=l i=l 



subject to the constraints 





i 






E c * = 


= 




8=1 




a 




a 


1 


<Ci< 


1 



1 /. 



This different formulation of SVM will be useful in section 8 when we will describe the relation 
between SVM and sparse approximation techniques. 

5.2 From regression to classification 

In the previous section we discussed the connection between regression and classification in the 
framework of regularization. In this section, after stating the formulation of SVM for binary 
pattern classification (SVMC) as developed by Cortes and Vapnik [22], we discuss a connection 



^In degenerate cases however, it can happen that points whose error is equal to e are not SVs. 
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between SVMC and SVMR. We will not discuss the theory of SVMC here; we refer the reader 
to [96]. We point out that the SVM technique has first been proposed for binary pattern clas- 
sification problems and then extended to the general regression problem [95]. Here our primary 
focus is regression and we consider classification as a special case of regression. 
SVMC can be formulated as the problem of minimizing: 

W) = )E|l-^/(x,)| + + ^imiL (62) 

which is again of the form (3). Using the fact that yi G {—1, +1} it is easy to see that our formu- 
lation (equation (62)) is equivalent to the following quadratic programming problem, originally 
proposed by Cortes and Vapnik [22]: 



Problem 5.4 



subject to the constraints: 



min<E>(/,£) = y£& + J 



fc >o, i = i,..., J. {b6) 

The solution of this problem is again of the form: 

/(x) = 5>^(x 4 ,x) + &, (64) 

where it turns out that < en < j. The input data points Xj for which en is different from 
zero are called, as in the case of regression, support vectors (SVs). It is often possible to write 
the solution /(x) as a linear combination of SVs in a number of different ways (for example in 
case that the feature space induced by the kernel K has dimensionality lower than the number of 
SVs). The SVs that appear in all these linear combinations are called essential support vectors. 
Roughly speaking the motivation for problem (5.4) is to minimize the empirical error measured 
by Z)i=i & 21 while controlling capacity measured in terms of the norm of / in the RKHS. In fact, 
the norm of / is related to the notion of margin, an important idea for SVMC for which we refer 
the reader to [96, 15]. 

We now address the following question: what happens if we apply the SVMR formulation given 
by problem (5.1) to the binary pattern classification case, i.e. the case where y$ take values 
{ — 1, 1}, treating classification as a regression on binary data? 

Notice that in problem (5.1) each example has to satisfy two inequalities (which come out of 
using the ILF), while in problem (5.4) each example has to satisfy one inequality. It is possible 
to show that for a given constant C in problem (5.4), there exist C and e in problem (5.1) such 
that the solutions of the two problems are the same, up to a constant factor. This is summarized 
in the following theorem: 



21 As we mentioned in section 2, for binary pattern classification the empirical error is defined as a sum of binary 
numbers which in problem (5.4) would correspond to X)j=i $(&)• However in such a case the minimization prob- 
lem becomes computationally intractable. This is why in practice in the cost functional <&(/, £) we approximate 
#(£,) with £j. We discuss this further in section 6. 
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Theorem 5.1 Suppose the classification problem (5.4) is solved with parameter C , and the opti- 
mal solution is found to be f . Then, there exists a value a G (0, 1) such that for We G [a, 1), if the 
regression problem (5.1) is solved with parameter (1 — e)C , the optimal solution will be (1 — e)f 



We refer to [76] for the proof. A sketch of the proof is given in Appendix D. A direct implication 
of this result is that one can solve any SVMC problem through the SVMR formulation. A formal 
proof of this result can also be given in the framework of SRM as discussed in Appendix D. It is 
an open question what theoretical implications theorem 5.1 may have about SVMC and SVMR. 
In particular in section 6 we will discuss some recent theoretical results on SVMC that have not 
yet been extended to SVMR. It is possible that theorem 5.1 may help to extend them to SVMR. 

6 SRM for RNs and SVMs 

At the end of section 2 we outlined how one should implement both RN and SVM according 
to SRM. To use the standard SRM method we first need to know the VC-dimension of the 
hypothesis spaces we use. In sections 4 and 5 we saw that both RN and SVM use as hypothesis 
spaces sets of bounded functions / in a RKHS with \\f\\K bounded (i.e. \\f\\K < A), where k is 
the kernel of the RKHS. Thus, in order to use the standard SRM method outlined in section 2 
we need to know the VC dimension of such spaces under the loss functions of RN and SVM. 
Unfortunately it can be shown that when the loss function V is (y — /(x)) 2 (L2) and also when 
it is lyi - /(xj)| £ (L e ), the VC-dimension of V(y,f(x)) with / in H A = {/ : ||/|| K < A} does 
not depend on A, and is infinite if the RKHS is infinite dimensional. More precisely we have 
the following theorem (for a proof see for example [103, 36], or for an outline of the proof see 
Appendix E): 

Theorem 6.1 Let N be the dimensionality of a RKHS TZ. For both the L2 and the e-insensitive 
loss function V, the VC-dimension of V in the space Ha — {/ G TZ : ||/||at < A} is O(N), 
independently of A. Moreover, if N is infinite, the VC-dimension is infinite for any positive A. 

It is thus impossible to use SRM with this kind of hypothesis spaces: in the case of finite 
dimensional RKHS, the RKHS norm of / cannot be used to define a structure of spaces with 
different VC-dimensions, and in the (typical) case that the dimensionality of the RKHS is infinite, 
it is not even possible to use bound (16). So the VC-dimension cannot be used directly neither 
for RN nor for SVMR. 

On the other hand, we can still use the V^ dimension and the extended SRM method outlined 
in section 2. Again we need to know the V^ dimension of our loss function V in the space Ha 
defined above. In the typical case that the input space X is bounded, the V 1 dimension depends 
on A and is not infinite in the case of infinite dimensional RKHS. More precisely the following 
theorem holds (for a proof see [36]): 

Theorem 6.2 Let N be the dimensionality of a RKHS TZ with kernel K . Assume our input space 
X is bounded and let R be the radius of the smallest ball B containing the data x in the feature 
space induced by kernel K . The V^ dimension h for regression using Li or L e loss functions for 
hypothesis spaces Ha — {/ G 1Z \ \\f\\K < A} and y bounded, is finite for V 7 > 0, with h < 
Ofmin (N, (* 2+ y +1) ;;- 
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Notice that for fixed 7 and fixed radius of the data the only variable that controls the V^ dimension 
is the upper bound on the RKHS norm of the functions, namely A. Moreover, the V^ dimension 
is finite for V 7 > 0; therefore, according to theorem (2.3), ERM uniformly converges in Ha for 
any A < 00, both for RN and for SVMR. Thus both RNs and SVMR are consistent in Ha for 
any A < 00. Theoretically, we can use the extended SRM method with a sequence of hypothesis 
spaces Ha each defined for different As. To repeat, for a fixed 7 > (we can let 7 go to as 
/ — > 00) we first define a structure Hi C Hi C . . . C H n m where H m is the set of bounded 
functions / in a RKHS with ||/||k < A m , A m < 00, and the numbers A m form an increasing 
sequence. Then we minimize the empirical risk in each H m by solving the problem: 

1 ' 
minimize -^V(y i ,/(x i )) 

1 %=\ 
subject to : \\J\\k < A m (65) 

To solve this minimization problem we minimize 

yEVfo,/(xO) + A m (||/|&-^) ( 66 ) 

with respect to / and maximize with respect to the Lagrange multiplier A m . If f m is the solution 
of this problem, at the end we choose the optimal f n *m in F n *m with the associated A n *m, where 
optimality is decided based on a trade off between empirical error and the bound (21) for the 
fixed 7 (which, as we mentioned, can approach zero). In the case of RN, V is the L 2 loss function, 
whereas in the case of SVMR it is the e-insensitive loss function. 

In practice it is difficult to implement the extended SRM for two main reasons. First, as we 
discussed in section 2, SRM using the V^ dimension is practically difficult because we do not 
have tight bounds to use in order to pick the optimal F n *m (combining theorems 6.2 and 2.4, 
bounds on the expected risk of RN and SVMR machines of the form (65) can be derived, but 
these bounds are not practically useful). Second, even if we could make a choice of F n *^, it is 
computationally difficult to implement SRM since (65) is a constrained minimization problem 
one with non-linear constraints, and solving such a problem for a number of spaces H m can be 
computationally difficult. So implementing SRM using the V^ dimension of nested subspaces of 
a RKHS is practically a very difficult problem. 

On the other hand, if we had the optimal Lagrange multiplier X n *(i), we could simply solve the 
unconstrained minimization problem: 



1 ' 



1 i=\ 



I l/lk (67) 



both for RN and for SVMR. This is exactly the problem we solve in practice, as we described in 
sections 4 and 5. Since the value A n *m is not known in practice, we can only "implement" the 
extended SRM approximately by minimizing (67) with various values of A and then picking the 
best A using techniques such as cross-validation [1, 100, 101, 49], Generalized Cross Validation, 
Finite Prediction Error and the MDL criteria (see [96] for a review and comparison). 
Summarizing, both the RN and the SVMR methods discussed in sections 4 and 5 can be seen 
as approximations of the extended SRM method using the V 1 dimension, with nested hypothesis 
spaces being of the form Ha = {/ € 71 '■ WfWx < A}, 7Z being a RKHS defined by kernel K. 
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For both RN and SVMR the V 1 dimension of the loss function V in Ha is finite for V 7 > 0, so 
the ERM method uniformly converges in Ha for any A < 00, and we can use the extended SRM 
method outlined in section 2. 

6.1 SRM for SVM Classification 

It is interesting to notice that the same analysis can be used for the problem of classification. In 
this case the following theorem holds [35]: 

Theorem 6.3 Let N be the dimensionality of a RKHS TZ with kernel K . Assume our input space 
X is bounded and let R be the radius of the sphere where our data x belong to, in the feature 
space induced by kernel K. The Vy dimension of the soft margin loss function |1 — yf(x)\ + 
in Ha = {/ € 1Z : ||/||k < ^4} is < O (min(N , R 2 )). In the case that N is infinite the Vy 
dimension becomes < 0( R 2 ), which is finite for V 7 > 0. 



This theorem, combined with the theorems on Vy dimension summarized in section 2, can be 
used for a distribution independent analysis of SVMC (of the form (65)) like that of SVMR and 
RN. However, a direct application of theorems 6.3 and 2.4 leads to a bound on the expected 
soft margin error of the SVMC solution, instead of a more interesting bound on the expected 
mis classification error. We can bound the expected mis classification error as follows. 
Using theorem 2.4 with the soft margin loss function we can get a bound on the expected soft 
margin loss in terms of the empirical one (the Yh=i & of problem 5.4) and the Vy dimension given 
by theorem 6.3. In particular theorem 2.4 implies: 

Prjsup \I cmp [f;l}-I[f}\e>)<G(e,m,hy), (68) 

where both the expected and the empirical errors are measured using the soft margin loss func- 
tion, and hy is the Vy dimension of theorem 6.3 for 7 = ae and a as in theorem 2.4. On the other 
hand, 0(— y/(x)) < |1 — y/(x)|+ for V (x, y), which implies that the expected misclassification 
error is less than the expected soft margin error. Inequality (68) implies that (uniformly) for all 
/ e H A : 

Pr {/[/] > e + J emp [/; I}} < G(e, m, hj, (69) 

Notice that (69) is different from existing bounds that use the empirical hard margin (6(1 — 
y/(x))) error [8]. It is similar in spirit to bounds in [87] where the Yh=i Cf is used 22 . On the 
other hand, it can be shown [35] that the Vy dimension for loss functions of the form |1 —yf(x)\° h 
is of the form 0( R 4 ) for V < a < 1. Thus, using the same approach outlined above for the soft 

margin, we can get bounds on the misclassification error of SVMC in terms of X^ =1 (£i) cr , which, 
for a near 0, is close to the margin error used in [8] (for more information we refer the reader 
to [35]). It is important to point out that bounds like (69) hold only for the machines of the 
form (65), and not for the machines of the form (3) typically used in practice [35]. This is unlike 
the bound in [8] which holds for machines of the form (65) and is derived using the theoretical 



22 The X)j=i & can be very different from the hard margin (or the misclassification) error. This may lead to 
various pathological situations (see for example [80]). 
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results of [6] where a type of "continuous" SRM (for example for a structure of hypothesis spaces 
defined through the continuous parameter A of (65)) is studied 23 . 

In the case of classification the difficulty is the minimization of the empirical misclassification 
error. Notice that SVMC does not minimize the misclassification error, and instead minimizes 
the empirical error using the soft margin loss function. One can use the SRM method with 
the soft margin loss function (6), in which case minimizing the empirical risk is possible. The 
SRM method with the soft margin loss function would be consistent, but the misclassification 
error of the solution may not be minimal. It is unclear whether SVMC is consistent in terms of 
misclassification error. In fact the V^ dimension of the misclassification loss function (which is 
the same as the VC-dimension - see section 2) is known to be equal to the dimensionality of the 
RKHS plus one [96]. This implies that, as discussed at the beginning of this section, it cannot 
be used to study the expected misclassification error of SVMC in terms of the empirical one. 

6.1.1 Distribution dependent bounds for SVMC 

We close this section with a brief reference to a recent distribution dependent result on the 
generalization error of SVMC. This result does not use the V^ or VC dimensions, which, as 
we mentioned in section 2, are used only for distribution independent analysis. It also leads to 
bounds on the performance of SVMC that (unlike the distribution independent ones) can be 
useful in practice 24 . 

For a given training set of size /, let us define SVi to be the number of essential support vectors 
of SVMC, (as we defined them in section 5 - see eq. (64)). Let Ri be the radius of the smallest 
hypersphere in the feature space induced by kernel K containing all essential SVs, ||/||#(/) the 
norm of the solution of SVMC, and oil) = ,,,,,2 m the margin. Then for a fixed kernel and for a 
fixed value of the SVMC parameter C the following theorem holds: 

Theorem 6.4 (Vapnik, 1998) The expected misclassification risk of the SVM trained on m data 
points sampled from X x Y according to a probability distribution P(x, y) is bounded by: 



min ( SV i+ i, 
E ' 




l + l 

where the expectation E is taken over P(x,y). 

This theorem can also be used to justify the current formulation of SVMC, since minimizing 
I l/l \ 2 K (l) (which is what we do in SVMR) affects the bound of theorem (6.4). It is an open question 
whether the bound of (6.4) can be used to construct learning machines that are better than 
current SVM. The theorem suggests that a learning machine should, instead of only minimizing 

Il/H 2 ^, minimize min ( SVi, -jjrh )• Finally, it is an open question whether similar results exist 

for the case of SVMR. As we mentioned in section 5, the connection between SVMC and SVMR 
outlined in that section may suggest how to extend such results to SVMR. The problem of 
finding better distribution dependent results on the generalization capabilities of SVM is a topic 
of current research which may lead to better learning machines. 



23 All these bounds are not tight enough in practice. 

24 Further distribution dependent results have been derived recently - see [47, 16, 34]. 
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7 A Bayesian Interpretation of Regularization and SRM? 

7.1 Maximum A Posteriori Interpretation of Regularization 

It is well known that a variational principle of the type of equation (1) can be derived not only in 
the context of functional analysis [92], but also in a probabilistic framework [51, 102, 100, 73, 58, 
11]. In this section we illustrate this connection for both RN and SVM, in the setting of RKHS. 
Consider the classical regularization case 

^^[/] = 7E(^-/(^)) 2 + A||/||^ (70) 

fen I i=1 

Following Girosi et al. [39] let us define: 

1. Di — {(xj, yi)} for % — 1, • • • , I to be the set of training examples, as in the previous sections. 

2. V[f\Di] as the conditional probability of the function / given the examples Di. 

3. V[Di\f] as the conditional probability of Di given /. If the function underlying the data is 
/, this is the probability that by random sampling the function / at the sites {xj}' =1 the 
set of measurement {yi}\ =1 is obtained. This is therefore a model of the noise. 

4. V[f]: is the a priori probability of the random field /. This embodies our a priori knowledge 
of the function, and can be used to impose constraints on the model, assigning significant 
probability only to those functions that satisfy those constraints. 

Assuming that the probability distributions V[Di\f] and V[f] are known, the posterior distribu- 
tion V[f\Di] can now be computed by applying the Bayes rule: 

P[/|A]«P[A|/]P[/]. (71) 

If the noise is normally distributed with variance a, then the probability V[Di\f] can be written 
as: 

P[A|/]oce-^£->- /(x < ))2 . 
For now let us write informally the prior probability V[f] as 

V[f] oc e-^K . (72) 

Following the Bayes rule (71) the a posteriori probability of / is written as 

V[f\Di] oc e -[^ELi(^-/(^)) 2 +ll/llx] . (73) 

One of the several possible estimates [58] of the function / from the probability distribution 
(73) is the so called MAP (Maximum A Posteriori) estimate, that considers the function that 
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maximizes the a posteriori probability V[f\Di], and therefore minimizes the exponent in equation 
(73). The MAP estimate of / is therefore the minimizer of the functional: 

yE(^-/(x,)) 2 + ya||/||| 
1 i=i l 

where a is the a priori defined constant 2a 2 , that is 

yI>-/(x,)) 2 + A 
1 i=\ 



where \ = j. This functional is the same as that of equation (70), but here it is important to 
notice that \(l) = j. As noticed by Girosi et al. [39], functionals of the type (72) are common 
in statistical physics [67], where the stabilizer (here H/H^) plays the role of an energy functional. 
As we will see later, the RKHS setting we use in this paper makes clear that the correlation 
function of the physical system described by \\f\\\ is the kernel K(x, y) 25 . 
Thus in the standard MAP interpretation of RN the data term is a model of the noise and the 
stabilizer is a prior on the regression function /. The informal argument outlined above can be 
made formally precise in the setting of this paper in which the stabilizer is a norm in a RKHS 
(see also [102]). To see the argument in more detail, let us write the prior (72) as: 

II fll 2 — X^ M "» 

P[f] oc e~ ll/l1 ^ = e ^™=i ^ 

where M is the dimensionality of the RKHS, with possibly M = oo. Of course functions / can 
be represented as vectors a in the reference system of the eigenfunctions <p n of the kernel K since 

M 

/(x) = ^a„0 n (x). (74) 



The stabilizer 



71=1 



M al 



n=l ^n 

can of course be also expressed in any other reference system (<// = A<p) as 



\ = b T £ x b 

which suggests that E can be interpreted as the covariance matrix in the reference system of the 
(j)'. It is clear in this setting that the stabilizer can be regarded as the Malahanobis distance of 
/ from the mean of the functions. P[f] is therefore a multivariate Gaussian with zero mean in 
the Hilbert space of functions defined by K and spanned by the <p n : 



P[f] ex e 



/Hi _ ,-(b r s-ib) 



25 As observed in [39, 69] prior probabilities can also be seen as a measure of complexity, assigning high com- 
plexity to the functions with small probability. This is consistent with the Minimum Description Length (MDL) 
principle proposed by Rissanen [81] to measure the complexity of a hypothesis in terms of the bit length needed to 
encode it. The MAP estimate mentioned above is closely related to the Minimum Description Length Principle: 
the hypothesis / which for given Di can be described in the most compact way is chosen as the "best" hypothesis. 
Similar ideas have been explored by others (see [95, 96] for a summary). 
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Thus the stabilizer can be related to a Gaussian prior on the function space. 
The interpretation is attractive since it seems to capture the idea that the stabilizer effectively 
constrains the desired function to be in the RKHS defined by the kernel K. It also seems to apply 
not only to classical regularization but to any functional of the form 

H\f\ = \'bv(y i -f(x i )) + \\\f\\ 2 K (75) 

where V"(x) is any monotonically increasing loss function (see [40]). In particular it can be 
applied to the SVM (regression) case in which the relevant functional is 

7 I> "/(*)!« + AII/II*- (76) 



j=i 



In both cases, one can write appropriate P[Di\f] and P[f] for which the MAP estimate of 

P[/|A]ocP[A|/]P[/] 

gives either equation (75) or equation (76). Of course, the MAP estimate is only one of several 
possible. In many cases, the average of / = / fdP\f\Dj\ may make more sense 26 (see [58]). This 
argument provides a formal proof of the well-known equivalence between Gaussian processes 
defined by the previous equation with P[f\Di] Gaussian and the RN defined by equation (70) 27 . 
In the following we comment separately on the stabilizer - common to RN and SVM - and on 
the data term - which is different in the two cases. 

7.2 Bayesian interpretation of the stabilizer in the RN and SVM 
functionals 

Assume that the problem is to estimate / from sparse data yi at location Xj. From the previous 
description it is clear that choosing a kernel K is equivalent to assuming a Gaussian prior on 
/ with covariance equal to K. Thus choosing a prior through K is equivalent a) to assume a 
Gaussian prior and b) to assume a correlation function associated with the family of functions /. 
The relation between positive definite kernels and correlation functions K of Gaussian random 
processes is characterized in details in [102], Theorem 5.2. In applications it is natural to use 
an empirical estimate of the correlation function, whenever available. Notice that in the MAP 
interpretation a Gaussian prior is assumed in RN as well as in SVM. For both RN and SVM 
when empirical data are available on the statistics of the family of functions of the form (74) one 
should check that P[f] is Gaussian and make it zero-mean. Then an empirical estimate of the 
correlation function E[f(x)f(y)] (with the expectation relative to the distribution P[f]) can be 
used as the kernel 28 . 

Notice also that the basis functions (f) n associated with the positive definite function K(x, y) 
correspond to the Principal Components associated with K. 



26 In the Gaussian case - Regularization Networks - the MAP and the average estimates coincide. 

27 Ironically, it is only recently that the neural network community seems to have realized the equivalence of 
many so-called neural networks and Gaussian processes and the fact that they work quite well (see [55] and 
references therein). 

28 We neglect here the question about how accurate the empirical estimation is. 
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7.3 Bayesian interpretation of the data term in the Regularization 
and SVM functional 

As we already observed the model of the noise that has to be associated with the data term of 
the SVM functional is not Gaussian additive as in RN. The same is true for the specific form 
of Basis Pursuit Denoising considered in section 8, given the equivalence with SVM. Data terms 
of the type V(yi — /(xj)) can be interpreted [40] in probabilistic terms as non-Gaussian noise 
models. Recently, Pontil, Mukherjee and Girosi [75] have derived a noise model corresponding 
to Vapnik's e-insensitive loss function. It turns out that the underlying noise model consists of 
the superposition of Gaussian processes with different variances and means, that is 29 : 



hoo roc I — , . 

cxpt l-HJ = / dt d(3\(t)fi(f3)^(3expi y -f3(x-t) 2 ) , (77) 



with: 



K(t) = ^-yy (x[-e,e](t) + *(* " e ) + *(* + e )) > 



(78) 



M/3)(x/5 2 expU^j. (79) 

where x\-e,e](t) is 1 for t G [— e, e], otherwise, 

For the derivation see Appendix F or [75] . Notice that the variance has a unimodal distribution 
that does not depend on e, and the mean has a distribution which is uniform in the interval [— e, e], 
(except for two delta functions at ±e, which ensures that the mean has not zero probability to 
be equal to ±e). The distribution of the mean is consistent with the current understanding of 
Vapnik's ILF: errors smaller than e do not count because they may be due entirely to the bias 
of the Gaussian noise. 

7.4 Why a MAP interpretation may be misleading 

We have just seen that minimization of both the RN and the SVMR functionals can be interpreted 
as corresponding to the MAP estimate of the posterior probability of / given the data, for certain 
models of the noise and for a specific Gaussian prior on the space of functions /. However, a MAP 
interpretation of this type may in general be inconsistent with Structural Risk Minimization and 
more generally with Vapnik's analysis of the learning problem. The following argument due to 
Vapnik shows the general point. 

Consider functionals (32) and (53). From a Bayesian point of view, instead of the parameter A - 
which in RN and SVM is a function of the data (through the SRM principle) - we have A which 
depends on the data as j\ the constant a has to be independent of the training data (i.e. their 
size /). On the other hand, as we discussed in section 2, SRM dictates a choice of A depending on 
the training set. It seems unlikely that A could simply depend on j as the MAP interpretation 
requires for consistency. Figure (7.4) gives a preliminar empirical demonstration that in the case 
of SVMR the "MAP" dependence of A as f may not be correct. 

Fundamentally, the core of Vapnik's analysis is that the key to learning from finite training sets 
is capacity control, that is the control of the complexity of the hypothesis space as a function of 
the training set. From this point of view the ability to choose A as a function of the training 



29 In the following we introduce the variable (3 — (2a 2 ) 1 
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Figure 1: An experiment (suggested by V. Vapnik) where the optimal A does not simply depend 
on the training set as A = y with a a constant and I the number of data points in the training 
set. In the right figure we plot XI as a function of the number of data. The data were generated 
from a 1-d sinusoid along 3 periods, with small uniform noise added. A SVMR with Gaussian 
kernel was used. We scaled the ordinate by 50 to compare with the log(log(/)) plot shown on 
the left. The number of training data ranged from 10 to 500. For each I we plot XI with A being 
the optimal one (i.e. -q for the SVMR) estimated by using the true function for validation. The 
right figure shows that XI is not a constant as the MAP interpretation would require. 



data is essential to our interpretation of Regularization and SVM in terms of the VC theory 
(compare the procedure described in our SRM section 2). Full capacity control and appropriate 
dependency of A on the training set, which we expect in the general case not to be simply of the 
form j, is lost in the direct MAP interpretation that we described in this chapter. Of course, an 
empirical Bayesian interpretation relying on hyper-parameters in the prior is possible and often 
useful but it amounts to little more than a parametric form for the posterior distribution, usually 
used in conjunction with maximum likelihood estimation of the parameters from the data. 

8 Connections between SVMs and Sparse Approxima- 
tion techniques 

In recent years there has been a growing interest in approximating functions and representing 
signals using linear superposition of a small number of basis functions selected from a large, 
redundant set of basis functions, called a dictionary. These techniques go under the name of 
Sparse Approximations (SAs) [18, 17, 65, 42, 24, 57, 21, 26]. We will start with a short overview 
of SAs. Then we will discuss a result due to Girosi [38] that shows an equivalence between SVMs 
and a particular SA technique. Finally we will discuss the problem of Independent Component 
Analysis (ICA), another method for finding signal representations. 

8.1 The problem of sparsity 

Given a dictionary of basis functions (for example a frame, or just a redundant set of basis 
functions) {</?i(x), . . . , </? n (x)} with n very large (possibly infinite), SA techniques seek an ap- 
proximation of a function /(x) as a linear combination of the smallest number of elements of the 
dictionary, that is, an approximation of the form: 
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/ c (x)=$>^(x), (80) 

with the smallest number of non-zero coefficients q. Formally, the problem is formulated as 
minimizing the following cost function: 

n 

£[c] = D(/(x),5>¥>i(x)) + e||c|| Lo , (81) 

i=i 

where D is a cost measuring the distance (in some predefined norm) between the true function 
/(x) and our approximation, the Lq norm of a vector counts the number of elements of that 
vector which are different from zero, and e is a parameter that controls the trade off between 
sparsity and approximation. Observe that the larger e is in (81), the more sparse the solution 
will be. 

In the more general case of learning function / is not given, and instead we have a data set 
D\ = {(xi, yi), . . . (xj, yi)} of the values yi of / at locations Xj 30 . Note that in order to minimize 
E[c] we need to know / at all points x. In the learning paradigm, in the particular case that 
D(f(x),YZ = iCi(pi(x)) = ||/(x) - Er=i c iV 5 i( x )lll 2) the first term in equation (81) is replaced by 
an empirical one, and (81) becomes: 

i l n 

7 £(& - E W(x0) 2 + e||c|| Lo (82) 

Minimizing (81) can be used as well to find sparse approximations in the case that the function 
/ is generated by a function /o corrupted by additive noise. In this case the problem can be 
formulated as finding a solution c to: 

/ = $ c + r] (83) 

with the smallest number of non-zero elements, where $ is the matrix with columns the elements 
of the dictionary, and rj is the noise. If we take a probabilistic approach and the noise is Gaussian, 
the problem can again be formulated as minimizing: 

n 

£[c] = |l/(x)-£c^(x)||i 2+e ||c|| Lo , (84) 

8=1 

Unfortunately it can be shown that minimizing (81) is NP-hard because of the Lq norm. In 
order to circumvent this shortcoming, approximated versions of the cost function above have 
been proposed. For example, in [18, 17] the authors use the L\ norm as an approximation of the 
L norm, obtaining an approximation scheme that they call Basis Pursuit De-Noising (BPDN) 
which consists of minimizing: 

n n 

^[c] = ||/(x)-X: C ^(x)||| 2 + e£|Q|, (85) 

i=l i=l 



30 



For simplicity we consider the case where P(x is the uniform distribution. 
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8.2 Equivalence between BPDN and SVMs 

In this section we consider the particular case in which we are given a data set D\ = {(xi, yi), . . . , (x/, yi)}, 
and the dictionary consists of basis functions of the form: 

(Pi(x) = K(x,Xi) Vz = l,...,Z (86) 

where K is the reproducing kernel of a RKHS 7i, and the size / of D\ is equal to the size n of the 
dictionary. Moreover, following [38], we assume that /(x) in eq. (81) is in the RKHS, and we 
use as the cost D in (81) the norm in the RKHS 7i induced by the kernel K, and approximate 
the L norm with L\. Under these assumptions, we get the SA technique that minimizes: 



£[c] = |l/(x)-Vc^(x)||^ + e||c|| Ll . (87) 



i=\ 



subject to /(xj) = yi. 

It can be shown [38] that this technique is equivalent to SVMR in the following sense: the two 
techniques give the same solution, which is obtained by solving the same quadratic programming 
problem. Girosi [38] proves the equivalence between SVMR and BPDN under the assumption 
that the data set {(xj,^)}' =1 has been obtained by sampling, in absence of noise, the target 
function /. Functional (87) differs from (85) only in the cost D. While Chen et al., in their 
BPDN method, measure the reconstruction error with an L 2 criterion, Girosi measures it by the 
true distance, in the 7i norm, between the target function / and the approximating function /*. 
This measure of distance, which is common in approximation theory, is better motivated than 
the L 2 norm because it not only enforces closeness between the target and the model, but also 
between their derivatives, since || • \\k is a measure of smoothness. 

Notice that from eq. (87) the cost function E cannot be computed because it requires the 
knowledge of / (in the first term). If we had || • ||l 2 instead of || ■ \\k in eq. (87), this would force 
us to consider the approximation: 

ii/(x)-nx)iii 2 ^5>-r( Xj )) 2 (88) 

However if we used the norm \\-\\k we can use the reproducing property (26) obtaining (see [38]): 

i l i 

E[c} = ^(\\f\\ 2 K+ E c * c ^( x ^)-2E c ^) + e||c|| Ll (89) 

i,j=l i=l 

Observe that functional (89) is the same as the objective function of SVM of problem 5.3 up to 
the constant jII/IIa - - However, in the SVM formulation the coefficients q satisfy two constraints, 
which in the case of sparsity are trivially satisfied under further assumptions. For details see 
[38] . It also follows from eq. (80) and (86) that the approximating function is of the form: 

r(x) = / c (x) = ^c^(x; Xl ). (90) 

?=i 

This model is similar to the one of SVM (eq. (55)), except for the constant b. 

This relation between SVMR and SA suggests directly that SVM yield a sparse representation. 
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8.3 Independent Component Analysis 

Independent Component Analysis (ICA) is the problem of finding unknown sources whose linear 
superposition gives a number of observed signals, under the only assumption that the sources 
are statistically independent. A particular application is Blind Source Separation (BSS) where 
one is given a signal and seeks to decompose it as a linear combination of a number of unknown 
statistically independent sources. Following the notation in [4] , the problem can be formulated as 
finding at any time t both the n (n predefined) sources x(t) = (xi(t), . . . ,x n (t)) and the mixing 
matrix A (which is assumed to be the same for every t) of the system of linear equations: 

s(t) = Ax(t) + v (91) 

where s(t) is our observed signal at time t, the elements of x(£), namely Xi(t), are generated by 
statistically independent sources, and rj is additive noise. 

Observe that for any t the formulations of ICA and SA (see eq. (83)) are similar ($ is A, f is 
s(t) and c is x(t)). The difference is that in the case of SA we know the mixing matrix ("basis") 
A ($) and we only solve for the sources x (c) with the smallest number of non-zero elements, 
while for ICA and BSS both the matrix A and the sources x are unknown, and we assume that 
Xi(t) are statistically independent, while we don't have any explicit restriction on A. 
Various methods for ICA have been developed in recent years [3, 9, 63, 53, 65]. A review of 
the methods can be found in [52]. Typically the problem is solved by assuming a probability 
distribution model for the sources Xi(t). A typical prior distribution is the Laplacian, namely 
P(x(t)) oc .el xl (*)l + "' + l x "(*)l. Moreover, if the noise r\ is Gaussian with zero mean and variance a 2 , 
then, for a given A, the probability of s(t) given A can be written as: 

I \S(t) — A'X.(t) 1 1^ 

P(s(t)\A) = P(s(t)\A, x(t)) • P(x(t)) oc -e 2g * " ■ e |xi(*)|+-+|x„(*)| ^ 

The MAP estimate of (92) gives x(t) as the minimizer of: 

n 

||s(t)-Ax(t)|| 2 + e.$:i^(t)| (93) 

i=i 

Observe that this is the same as that of BPDN (eq. (85)). Therefore, for a fixed A the sources 
can be found by solving a (BPDN) problem. In fact iterative methods where at every iteration 
A is fixed and the sources are found, and then for fixed sources, A is updated using a learning 
rule have been developed in [65]. 

To summarize, using a Laplacian prior on the sources and following an iterative method for 
solving both for the sources and for their linear combination, ICA and BSS can be seen as 
iterative methods where at each iteration one solves a SA problem. This connection between 
ICA and sparsity has also been studied in [64]. Notice that if the prior on the sources is different, 
in particular if it is super-Gaussian, then the solution at every iteration need not be sparse. 

9 Remarks 

9.1 Regularization Networks can implement SRM 

One of the main focuses of this review is to describe and motivate the classical technique of 
regularization - minimization of functionals such as in equation (1) - within the framework of 
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VC theory. In particular we have shown that classical regularization functionals can be motivated 
within the statistical framework of capacity control. 

9.2 The SVM functional is a special formulation of regularization 



Classical Regularization 


ff[/] = i£U(w -/(*))* + A||/||jk 


SVM Regression (SVMR) 


ff[/] = *EUIVi-/(xO| e + A||/||k 


SVM Classification (SVMC) 


#[/] = fE: =1 ii-^/(xoi + +Aii/ii^ 







Table 2: A unified framework: the minimizer of each of these three functionals has always the 
same form: /(x) = YJ i= i CiK(x, Xj) or /(x) = YJ i= i CiK(x, Xj) + b. Of course in classification the 
decision function is sign(f(x)). 

Throughout our review it is clear that classical Regularization Networks, as well as Support 
Vector Machines for regression and Support Vector Machines for classification (see Table (2)), 
can be justified within the same framework, based on Vapnik's SRM principle and the notion 
of Vy dimension. The three functionals of the table have different loss functions V(-, ■) but 
the same stabilizer. Thus the minimizer has the same general form and, as a consequence, the 
associated network has the same architecture. In particular, RKHS, associated kernels, and the 
mapping they induce from the input space into a higher dimensional space of features <f) n , are 
exactly the same in SVM as in RN. The different loss functions of SVM determine however 
quite different properties of the solution (see Table (2)) which is, unlike regularization, sparse 
in the c n . Notice that loss functions different from quadratic loss have been used before in the 
context of regularization. In particular, the physical analogy of representing the data term using 
nonlinear spring (classical L 2 regularization corresponds to linear springs) was used and studied 
before (for instance see [40]). It is, however, the specific choice of the loss functions in SVMC 
and SVMR that provides several of their characteristic features, such as sparsity of the solution. 
Notice also that the geometric interpretation of ||/||#- in terms of the margin [96] is true only for 
the classification case and depends on the specific loss function V(-, ■) used in SVMC. 

9.3 SVM, sparsity and compression 

From the Kuhn- Tucker conditions of the QP problem associated with SVM one expects the 
Support Vectors to be usually sparser than the data. Notice that this is not obvious from a 
direct inspection of the functional H[f] itself, where the regularizer is a L 2 norm on the function 
space. Especially in the case of regression it is not immediately obvious that the H[f] in SVMR 
should yield a sparser solution than the H[f] of classical regularization (see Table (2)). The 
equivalence of SVMR with a special form of Basis Pursuit Denoising shows that the e-insensitive 
loss function with a L 2 regularizer is equivalent to a L 2 loss function and a L\ regularizer. The 
latter is known to yield sparsity, though it is only an approximation of a "true" sparsity regularizer 
with the L norm. Notice that SVM - like regularization - uses typically many features <f) n , but 
only - unlike regularization - a sparse subset of the examples. Thus SVM is not sparse in the 
primal representation (see section 3) of the classifier (or regressor) but it is sparse in the dual 
representation since it tends to use a subset of the dictionary consisting of the set of K(x, Xj). 
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In this context, an interesting perspective on SVM is to consider its information compression 
properties. The support vectors represent in this sense the most informative data points and 
compress the information contained in the training set: for the purpose of, say, classification of 
future vectors, only the support vectors need to be stored, while all other training examples can 
be discarded. There is in fact a relation between the compression factor expressed as the ratio 
of data points to support vectors and the probability of test error. Vapnik [96], in comparing 
the empirical risk minimization principle with the Minimum Description Length principle [81], 
derives a bound on the generalization error as a function of the compression coefficient. 

9.4 Gaussian processes, regularization and SVM 

The very close relation between Gaussian processes and RN is well known [58, 102]. The connec- 
tion is also valid for SVM in regression as well as in classification, since it depends on the form 
of the stabilizer, which is the same. The functional H of classical regularization is the exponent 
of the Gaussian conditional probability distribution characterizing the Gaussian process. The 
MAP estimate applied to the probability distribution corresponds to minimization of H yielding 
Regularization Networks - of which Radial Basis Function networks are a special case. Thus RN 
are connected to Gaussian processes via the MAP estimate, which in this case coincides with 
another estimate - the posterior mean. 

9.5 Kernels and how to choose an input representation 

A key issue in every learning problem concerns the input (and output) representation. This issue 
is outside the scope of the theory outlined in this review. There are however a few interesting 
observations that can be made. As pointed out by Vapnik, the choice of the kernel K is equivalent 
to choosing features related to the original inputs x by well-behaved functions n (x), where the 
<p n are defined by K(x, y) = J2n=i ^n0n( x )0n(y)- Assume that K is given and that the input 
representation is now changed through a vector function h(x) mapping the original input x into 
the new feature vector h. This is equivalent to using a new kernel K' defined in terms of the 
composite features </> n (h(x)) as K'(x, y) = Y.n=i ^n0n(h(x))0 n (h(y)). For example, in the case 
of a polynomial kernel K = (1 + x ■ y) d , a linear transformation of the input data x' = P T x is 
equivalent to using a new kernel K'(x, y) = (l+xPP T y) d . Clearly in the case that the projection 
is onto an orthonormal basis so that matrix P is orthonormal, the transformation does not affect 
the learning machine. On the other hand, if P is a matrix whose columns form an overcomplete 
or undercomplete set of basis functions, the transformation can change the learning machine. In 
many cases - especially when K is an expansion in an infinite series of (f) n - the most natural 
description is in terms of the kernel itself. In other cases, the best strategy is to define a finite 
set of features <\> n and then construct the kernel by computing fC(x, y) = Y^=\ ^«.0n( x )0n(y)- 

Synthesis of kernels from kernels 

There are several symmetric positive definite kernels and a number of ways to construct new 
ones from existing kernels by operating on them with a few operations such as addition and 
convolution. For instance, if K\ and K 2 are kernels then K\ + K 2 is a kernel and K\K 2 is a 
kernel; (Ki) n is a kernel. Thus the kernel I]f =0 ( x 'y) J corresponds to the features of a polynomial 
of degree d in the spirit of [68]; Vapnik's kernel fT(x, y) = (1 + x ■ y) d is in fact equivalent and 
more compact. Aronszajn [5] describes several ways to construct positive definite kernels and 
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thereby the associated RKHS. A completely equivalent analysis exists for correlation functions. 

Exploiting prior information 

In practical problems the choice of the regressors is often much more important than the choice 
of the learning machine. The choice of an appropriate input representation depends of course 
on prior information about the specific regression or classification problem. A general theory on 
how to use prior information to determine the appropriate representation is likely to be very far 
away. There are however a few approaches which yield some promise. 

• Kernels and estimates of correlation functions. 

Assume that the problem is to estimate / from sparse data yi at location Xj. As we 
described in section 7, let us assume that there is prior information available in terms 
of the correlation function i?(x, y) = E[f(x)f(y)] of the family of functions to which / 
belongs. In applications, for instance, it may be possible to obtain an empirical estimate of 
the correlation function. From a Bayesian point of view this prior information together with 
the assumption of a Gaussian prior, determines the choice of the kernel K = R and this 
automatically determines the feature representation - the <fi n - to be used in the regression 
problem. Preliminary experiments indicate that this strategy may give better results than 
other regression approaches [66]. 

• Invariances and Virtual Examples. 

In many pattern recognition problem specific invariances are known to hold a priori. Niyogy 
et al. [62] showed how several invariances can be embedded in the stabilizer or, equivalently, 
in virtual examples (see for a related work on tangent distance [89] and [84]). 

• Generative probabilistic models. 

Jaakkola and Haussler [47] consider the case in which prior information is available in terms 
of a parametric probabilistic model P(x, y) of the process generating the data. They argue 
that good features for classification are the derivatives of logP with respect to the natural 
parameters of the distributions at the data points. 

9.6 Capacity control and the physical world 

An interesting question, outside the realm of mathematics, which has been asked recently is why 
large margin classifiers seem to work well in the physical world. As we saw throughout this review, 
the question is closely related to the question of why to assume smoothness in regression, that is 
why to use stabilizers such as ||/||#, which are usually smoothness functionals. Smoothness can 
be justified by observing that in many cases smoothness of input-output relations are implied 
directly by the existence of physical laws with continuity and differentiability properties. In 
classification, minimization of \\f\\K corresponds to maximization of the margin in the space of 
the n ; it is also equivalent to choosing the decision boundary resulting from thresholding the 
smoothest / in the original space, according to the smoothness criterion induced by K (notice 
that the decision boundary is the level crossing of / and not necessarily smooth everywhere). 
Conversely, we would not be able to generalize for input-output relations that are not smooth, 
that is for which "similar" inputs do not correspond to "similar" outputs (in an appropriate 
metric!). Such cases exist: for instance the mapping provided by a telephone directory between 
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names and telephone numbers is usually not "smooth" and it is a safe bet that it would be 
difficult to learn it from examples. In cases in which physical systems are involved, however, 
input-output relations have some degree of smoothness and can be learned. From this point of 
view large margin (in feature space) and smoothness are properties of the physical world that 
are key to allow generalization, learning and the development of theories and models. 
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A Regularization Theory for Learning 

Classical regularization methods proposed by Tikhonov and Arsenin [92] solve the learning prob- 
lem by restricting the space of functions to be the domain of a functional Q(f), called the 
stabilizer, which possesses the following three properties: 

• The unknown function / is assumed to belong to the domain D(Q) of functional fi(/). 

• On the domain D(Q) the functional Q(f) admits real nonnegative values. 

• The sets: 

M c = {/ : fl(/) < c} 
are compact for every real nonnegative c. 

For example, a functional of this sort typically used is the sum of the L 2 norms of the first 
k derivatives of /. In this case, the sets M c defined above are Sobolev spaces. Using such a 
functional means that we restrict our space of functions to be the space of smooth functions, the 
functions whose derivatives are in L 2 . 

Given such a functional Q(f), the idea of regularization is to find / as the minimizer of a certain 
loss functional which we take to be: 

N 

ff[/] = En/W-K) + ^[/]. (94) 

where V is the loss function and A is a positive number that is usually called the regularization 
parameter. The first term is enforcing closeness to the data, and the second enforces the solution 
to be in a set M c with a small c, while the regularization parameter controls the tradeoff between 
these two terms. For example, in the particular case that M c is a Sobolev space, the second 
term of the minimized functional enforces the smoothness of /. The first term in equation (94) 
is the empirical error, while the second term is usually called the smoothness functional since it 
enforces some sort of smoothness. Various methods for choosing A are proposed in the literature 
[1, 100, 101, 49, 96]. Under some conditions on the regularization parameter A, it can be shown 
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[92] that as the number of training examples increases the minimizer of equation (94) converges 
to the exact solution / in the space D(Q). 

To summarize, to solve the ill-posed problem of learning from examples using classical regular- 
ization methods, we need to restrict the space where we search for the solution, and minimize a 
functional that depends on the empirical error and a cost related with a functional defined on 
the space of functions we search. 

B An example of RKHS 

Here we present a simple way to construct meaningful RKHS of functions of one variable over 

[0, 27r]. In the following all the normalization factors will be set to 1 for simplicity. 

Let us consider any function K(x) which is continuous, symmetric, periodic, and whose Fourier 

coefficients \ n are positive. Such a function can be expanded in a uniformly convergent Fourier 

series: 

oo 

K{x) = V} A n cos(na;) . (95) 

ra=0 

An example of such a function is 

00 1 1 - h 2 

K(x) = 1 + V h n cos(nx) == ; — — 

v ; ^ v ; 27rl-2/icos(:r) + /i 2 

where h G (0, 1). 

It is easy to check that, if (95) holds, then we have: 

oo oo 

K(x — y) — 1 + y^ A n sin(na;) sin(ra/) + Y^ A n cos(nrr) cos(ny) (96) 

n=l n=l 

which is of the form (27) in which the set of orthogonal functions <fi n has the form: 

{<fii(x)}°l Q = (1, sin(x), cos(x), sm(2x), cos(2x), . . . , sin(nx), cos(na;), . . .) . 

Therefore, given any function K which is continuous, periodic and symmetric we can then define 
a RKHS TC over [0, 27r] by defining a scalar product of the form: 

00 f c a c + f s a s 

< J, 9 >H= 2^ " 



ra=0 * n 

where we use the following symbols for the Fourier coefficients of a function /: 

fn =< /. cos(nx) > , f s n =< /, sin(nx) > 

The functions in 7i are therefore functions in I/2QO, 2ir]) whose Fourier coefficients satisfy the 
following constraint: 

00 ( f c ) 2 4- ( f s ) 2 

ll/lft = E — Z <+OQ (97) 

ra=0 ^ n 

Since the sequence A n is decreasing, the constraint that the norm (97) has to be finite can be 
seen as a constraint on the rate of decrease to zero of the Fourier coefficients of the function /, 
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which is known to be related to the smoothness properties of /. Therefore, choosing different 
kernels K is equivalent to choose RKHS of functions with different smoothness properties, and 
the norm (97) can be used as the smoothness functional. 

C Regularized Solutions in RKHS 

Let us look more closely at the solution of the minimization of functional (3). This is equivalent 
to assume that the functions in 7i have a unique expansion of the form: 

oo 

/( X ) = X! C rA(x) 
n=l 

and that their norm is: 

oo 2 

\\f\\n = Y.f- 

71=1 An 

We can think of the functional H[f] as a function of the coefficients c n . In order to minimize 
H[f] we take its derivative with respect to c n and set it equal to zero, obtaining the following: 

-CJ2 V(yi, /(*))&,(*) + ^ = 0. (98) 

i=i An 

where we note by V the partial derivative of V w.r.t. /. Let us now define the following set of 
unknowns: 

Using eq. (98) we can express the coefficients c n as a function of the af. 

i 

Cn A n / ^ CLj(p n yXi). 
j=l 

The solution of the variational problem has therefore the form: 

oo oo I I 

/(x) = Y^ c„0„(x) = ^ ^aiA„0 n (xj)0 n (x) = ^ajif(x,Xj), (99) 

ra=l n=l j=l i=l 

where we have used the expansion (27). This shows that, independently of the form of V, as 
long as it is differentiable, the solution of the regularization functional H[f] is always a linear 
superposition of kernel functions, one for each data point. The loss function V affects the 
computation of the coefficients a,. In fact, plugging eq. (99) back in the definition of the a, we 
obtain the following set of equations for the coefficients af 

a, = CV' yi, J2 K ij a j > i = l,...,l 

where we have defined Kij = K(xi,x.j). In the case in which V(-, ■) — (• — -) 2 we obtain the 
classical regularization theory solution (see Girosi, Jones and Poggio, 1995 for an alternative 
derivation): 
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{K + 7/) a = y, 
where we have defined 7 = ^. 

D Relation between SVMC and SVMR 

We want to study problem 5.1 in the classification case {yi G {—1, 1} Vi). Note that when 
performing SVMR on { — 1, lj-vahied data, if e > 1, the optimal solution to problem 5.1 is 
/ = 0,£ = £* = 0. Therefore, we restrict our attention to cases were e < 1. 
We start by making the following variable substitution: 

-{Iftzii. >H!& (100) 

Combining this substitution with our knowledge that t/j G {—1,1} yields the following modifica- 
tion of problem 5.1: 



Problem D.l 



C i 

min $ c (/, T7, T7*) = - J^iVi + Vi) + d/ll* 



subject to the constraints: 

Vifi^-i) > 1 — e + 77i 2 = 1,...,/ 

Vifte) < 1 + e H- 77* z = l,...,Z (101) 

774,77? >0, i = l,...,Z. 

Continuing, we divide both sides of each constraint in problem D.l by 1 — e, and make the 
variable substitutions /' = yz - , i]' = y^-, 77'* = y?— : 

Problem D.2 

min ^ $«_(/', rfrf) = ±-^ £(£ + £•) + |ll/'ll* (102) 

subject to the constraints: 

Vifte) < T^e+V'f % = !,..., I (103) 

r/i,g > i = l,...,l. 

Notice that Problem D.2 looks very similar to the SVMC problem 5.4, the only difference being 
given by the additional constraint in problem D.2 associate to the variable 77'*. Through an 
analysis of the KKT conditions of problem D.2, it is easy to see that if /, £ solves problem 5.4 
with parameter C, under the additional condition that e G [a, 1), /' = f,f]'i = ^Vi* = solves 
problem D.2 with parameter C(l+e). Then (1 — e)/ is the solution of problem 5.1 with parameter 
C(l + e). This result can be applied as well to formulation (65): if /, £ solves of problem 5.4 
with parameter A and e G [a, 1), then (1 — e)f solves problem D.2 with parameter — . The 
constant a under which the relation is true can be related to the radius R of the smallest sphere 
containing all the data points and to the norm of the solution to the SVMC problem 5.4. See 
[76] for further details and a complete proof of the results reported here. 
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E Proof of the theorem 6.2 

Below we always assume that data X are within a sphere of radius R in the feature space defined 
by the kernel K of the RKHS. Without loss of generality, we also assume that y is bounded 
between —1 and 1. Let's consider first the case of the L\ loss function. Let B be the upper 
bound on the loss function (which always exists under our assumptions). From definition 2.7 we 
can decompose the rules for separating points as follows: 

class 1 if : y { - /(xj) > s + 7 

or /(xi) - Vi > s + 7 (104) 

class - 1 if : y { - /(x;)) < s - 7 

or /(x;)) - yi < s - 7 

for some £> — 7 > s > 7. Using this observation it is clear that for any N points, the number of 
separations we can get using rules (105) is not more than the number of separations we can get 
using the product of two "indicator functions with margin" : 

function (a) : class — 1 if : y t — /i(xj) > Si + 7 

class 1 if : y { - /i(xj)) < s x - 7 (105) 

function (b) : class 1 if : f 2(^-1) — Hi > s 2 + 7 

class - 1 if : / 2 (xj)) - yi < s 2 - 7 

where f\ and f 2 are in 7i, B — 7 > s±, s% > 7. For Si = S2 = s and for f\ = f 2 = f we recover 
(105): for example, if y — /(x) > s + 7 then indicator function (a) will give —1, indicator function 
(b) will give also —1, so their product will give +1 which is what we get if we follow (105). So 
since we give more freedom to fi, f 2 , Si,s 2 clearly we can get more separations for any set of 
points than we get using (105). 

As discussed in section 2, for any N points the number of separations is bounded by the growth 
function. Moreover, for products of indicator functions it is known [96] that the growth function 
is bounded by the product of the growth functions of the indicator functions. Furthermore, 
the indicator functions in (106) are hyperplanes with margin in the N + 1 dimensional space of 
vectors {</> n (x), y} where the radius of the data is R 2 + 1, the norm of the hyperplane is bounded 
by A 2 + 1, (where in both cases we add 1 because of y), and the margin is bounded by /J +1 - 
The Vy dimension /i 7 of these hyperplanes with margin is known [96, 8] to be bounded by h 1 < 

min((iV + 1) + 1, \ -). So the growth function of the separating rules (105) is bounded 

by Gil) < (A) ft 7(A)^7 whenever / > /i 7 . If h™ 9 is the Vj dimension of the L\ loss function, then 
clearly hlf 9 cannot be larger than the larger number / for which the inequality: 

2' < (f^(|V 7 (106) 



holds. From this we get that / < 5/i 7 , therefore h^ eg < 5 min (N + 2, - — ^—^ — ^) which proves 

the theorem for the case of L\ loss functions. 

The sketched proof can be extended to the general L p loss function and to the Vapnik's ILF [36] . 
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F The noise model of the data term in SVMR 

We compute here the probability distributions A e (t) and z/(<r) (see equations (78) and (79)) 

solving equation (77). 

From equation (77) computing the integral with respect to (3 we obtain: 



-M f+x 



e -l*L = / dt\(t)G(x-t) (107) 



oc 



where we have defined: 

G(t) = / df3fx(f3)Jf3e~ /3t (108) 

Jo v 

Observe that the function G is a density distribution, because both the functions in the r.h.s. of 

equation (108) are densities. In order to compute G we observe that for e = the function e~' x ' e 

becomes the Laplace distribution. In this case we can simply set \ e=Q (t) = S(t) and obtain G(t) 

from equation (107): 

G(t) = e~ w . (109) 

We can then compute the probability distribution fj, by inverting equation (108). This requires 
to computing the inverse Laplace transform of e - '*'. We obtain: 

Htf) =/r 2 e~^. (110) 

It remains to obtain the expression of X(t) for e > 0. To this purpose we write equation (107) in 
Fourier space: 

F[e-^} = G(u)\ e (u) (111) 

with: 

~ w _ sm(eu)+wcos(euj) 



and: 



Plugging equations (112) and (113) in equation (111) we obtain: 

~ , s sineuj _ , ,, 

\ e {oj) = h coseu. (114) 

UJ 

Finally taking the inverse Fourier Transform and normalizing we obtain equation (78). For more 
details see [75]. 
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